如何从字符串中删除重音字符? 特别是在IE6中,我有这样的东西:

accentsTidy = function(s){
    var r=s.toLowerCase();
    r = r.replace(new RegExp(/\s/g),"");
    r = r.replace(new RegExp(/[àáâãäå]/g),"a");
    r = r.replace(new RegExp(/æ/g),"ae");
    r = r.replace(new RegExp(/ç/g),"c");
    r = r.replace(new RegExp(/[èéêë]/g),"e");
    r = r.replace(new RegExp(/[ìíîï]/g),"i");
    r = r.replace(new RegExp(/ñ/g),"n");                
    r = r.replace(new RegExp(/[òóôõö]/g),"o");
    r = r.replace(new RegExp(/œ/g),"oe");
    r = r.replace(new RegExp(/[ùúûü]/g),"u");
    r = r.replace(new RegExp(/[ýÿ]/g),"y");
    r = r.replace(new RegExp(/\W/g),"");
    return r;
};

但是IE6让我很烦,它好像不喜欢我的正则表达式。


当前回答

通过给定的测试,一个解决方案似乎要快得多:http://jsperf.com/diacritics/9

function removeDiacritics(str) {
   return str.replace(/[^A-Za-z0-9\s]+/g, function(a){
      return diacriticsMap[a] || a; 
   });
}
removeDiacritics(teste);

工作示例:http://jsbin.com/sovorute/1/edit

推理:这样做速度更快的一个原因是,我们只遍历由负正则表达式模式选择的特殊字符。最快的测试(不带in的字符串迭代)在给定文本上迭代1001,这意味着每个字符。这个函数只迭代了35次,输出了相同的结果。请记住,这将只替换地图中所指示的内容。

关于这个主题的经典文章:http://alistapart.com/article/accent-folding-for-auto-complete

来源:http://semplicewebsites.com/removing-accents-javascript,也提供了一个不错的人物地图。

其他回答


function removeAccents(strAccents){
    strAccents = strAccents.split('');
    strAccentsOut = new Array();
    strAccentsLen = strAccents.length;
    var accents = 'ÀÁÂÃÄÅàáâãäåÒÓÔÕÕÖØòóôõöøÈÉÊËèéêëðÇçÐÌÍÎÏìíîïÙÚÛÜùúûüÑñŠšŸÿýŽž';
    var accentsOut = ['A','A','A','A','A','A','a','a','a','a','a','a','O','O','O','O','O','O','O','o','o','o','o','o','o','E','E','E','E','e','e','e','e','e','C','c','D','I','I','I','I','i','i','i','i','U','U','U','U','u','u','u','u','N','n','S','s','Y','y','y','Z','z'];
    for (var y = 0; y < strAccentsLen; y++) {
        if (accents.indexOf(strAccents[y]) != -1) {
            strAccentsOut[y] = accentsOut[accents.indexOf(strAccents[y])];
        }
        else
            strAccentsOut[y] = strAccents[y];
    }
    strAccentsOut = strAccentsOut.join('');
    return strAccentsOut;
}

谢谢大家 我使用这个版本并说明原因(因为我在开始时错过了这些解释,所以我试图帮助下一个读者,如果他和我一样无聊…)

备注:我想要一个高效的解决方案,所以:

只有一次regexp编译(如果需要) 每个字符串只扫描一个字符串 查找已翻译字符的有效方法 等等……

我的版本是: (里面没有新的技术技巧,只有一些精选的+解释)

makeSortString = (function() {
    var translate_re = /[¹²³áàâãäåaaaÀÁÂÃÄÅAAAÆccç©CCÇÐÐèéê?ëeeeeeÈÊË?EEEEE€gGiìíîïìiiiÌÍÎÏ?ÌIIIlLnnñNNÑòóôõöoooøÒÓÔÕÖOOOØŒr®Ršs?ߊS?ùúûüuuuuÙÚÛÜUUUUýÿÝŸžzzŽZZ]/g;
    var translate = {
"¹":"1","²":"2","³":"3","á":"a","à":"a","â":"a","ã":"a","ä":"a","å":"a","a":"a","a":"a","a":"a","À":"a","Á":"a","Â":"a","Ã":"a","Ä":"a","Å":"a","A":"a","A":"a",
"A":"a","Æ":"a","c":"c","c":"c","ç":"c","©":"c","C":"c","C":"c","Ç":"c","Ð":"d","Ð":"d","è":"e","é":"e","ê":"e","?":"e","ë":"e","e":"e","e":"e","e":"e","e":"e",
"e":"e","È":"e","Ê":"e","Ë":"e","?":"e","E":"e","E":"e","E":"e","E":"e","E":"e","€":"e","g":"g","G":"g","i":"i","ì":"i","í":"i","î":"i","ï":"i","ì":"i","i":"i",
"i":"i","i":"i","Ì":"i","Í":"i","Î":"i","Ï":"i","?":"i","Ì":"i","I":"i","I":"i","I":"i","l":"l","L":"l","n":"n","n":"n","ñ":"n","N":"n","N":"n","Ñ":"n","ò":"o",
"ó":"o","ô":"o","õ":"o","ö":"o","o":"o","o":"o","o":"o","ø":"o","Ò":"o","Ó":"o","Ô":"o","Õ":"o","Ö":"o","O":"o","O":"o","O":"o","Ø":"o","Œ":"o","r":"r","®":"r",
"R":"r","š":"s","s":"s","?":"s","ß":"s","Š":"s","S":"s","?":"s","ù":"u","ú":"u","û":"u","ü":"u","u":"u","u":"u","u":"u","u":"u","Ù":"u","Ú":"u","Û":"u","Ü":"u",
"U":"u","U":"u","U":"u","U":"u","ý":"y","ÿ":"y","Ý":"y","Ÿ":"y","ž":"z","z":"z","z":"z","Ž":"z","Z":"z","Z":"z"
    };
    return function(s) {
        return(s.replace(translate_re, function(match){return translate[match];}) );
    }
})();

我是这样用的:

var without_accents = makeSortString("wïthêüÄTrèsBïgüeAk100t");
// I let you guess the result,
// no I was kidding you : I give you the result : witheuatresbigueak100t

评论:

Tthe instruction inside it is done once (after, makeSortString != undefined) function(){...} is stored once in makeSortString, so the "big" translate_re and translate objects are stored once When you call makeSortString('something') it call directly the inside function which calls only s.replace(...) : it is efficient s.replace uses regexp (the special syntax of var translate_re= .... is in fact equivalent to var translate_re = new RegExp("[¹....Z]","g"); but the compilation of the regexp is done once for all, and the scan of the s String is done one for a call of the function (not for every character as it would be in a loop) For each character found s.replace calls function(match) where parameter match contains the character found, and it call the corresponding translated character (translate[match]) Translate[match] is probably efficient too as the javascript translate object is probably implemented by javascript with a hashtab or something equivalent and allow the program to find the translated character almost directly and not for instance through a loop on a array of all characters to find the right one (which would be awfully unefficient).

在NPM中有一个包:latinize

这是解决这个问题的一个很好的方案。

使用ES2015/ES6 String.prototype.normalize(),

const str = "Crème Brulée"
str.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
> "Creme Brulee"

注意:如果你想让\uFB01(fi)标准化(到fi),请使用NFKD。

这里发生了两件事:

Unicode标准格式将组合的字形分解为简单字形的组合。Crème的è最后表示为e + +。 使用正则表达式字符类来匹配U+0300→U+036F范围,现在可以很容易地全局消除变音符,Unicode标准将其方便地分组为组合变音符标记Unicode块。

从2021年开始,还可以使用Unicode属性转义:

str.normalize("NFD").replace(/\p{Diacritic}/gu, "")

有关性能测试,请参阅注释。

或者,如果你只是想排序

Intl。Collator有足够的支持~95%现在,polyfill也可以在这里,但我还没有测试它。

const c = new Intl.Collator();
["creme brulee", "crème brulée", "crame brulai", "crome brouillé",
"creme brulay", "creme brulfé", "creme bruléa"].sort(c.compare)
["crame brulai", "creme brulay", "creme bruléa", "creme brulee",
"crème brulée", "creme brulfé", "crome brouillé"]


["creme brulee", "crème brulée", "crame brulai", "crome brouillé"].sort((a,b) => a>b)
["crame brulai", "creme brulee", "crome brouillé", "crème brulée"]

您可以使用Lodash库中的_.deburr()方法。

它可以作为一个独立的NPM包lodash.deburr,也可以作为lodash包的一部分。

const myStringWithAccent = 'Mon café est plein de caféïne';
const myStringWithoutAccent = _.deburr( myStringWithAccent, );

结果就是:“我的咖啡里装满了咖啡因”