除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:

输入:或者čpžsíáýd

输出:orcpzsiayd

它不需要包括所有有口音的字母,比如俄语字母或汉语字母。


当前回答

从2011年开始,你可以使用Apache Commons stringutils . stripaccent (input)(从3.0开始):

    String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ");
    System.out.println(input);
    // Prints "This is a funky String"

注意:

接受的答案(Erick Robertson的)对于Ø或Ł不适用。Apache Commons 3.5也不能用于Ø,但它可以用于Ł。在阅读了维基百科上关于Ø的文章后,我不确定它是否应该被“O”取代:它在挪威语和丹麦语中是一个单独的字母,在“z”之后按字母顺序排列。这是“条形强调”方法局限性的一个很好的例子。

其他回答

@virgo47的解决方案非常快,但很接近。接受的答案使用Normalizer和正则表达式。我想知道Normalizer和正则表达式占用了多少时间,因为删除所有非ascii字符可以在没有正则表达式的情况下完成:

import java.text.Normalizer;

public class Strip {
    public static String flattenToAscii(String string) {
        StringBuilder sb = new StringBuilder(string.length());
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        for (char c : string.toCharArray()) {
            if (c <= '\u007F') sb.append(c);
        }
        return sb.toString();
    }
}

小的额外加速可以通过写入char[]而不调用toCharArray()来获得,尽管我不确定代码清晰度的降低是否值得这样做:

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = string.length(); i < n; ++i) {
        char c = string.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}

这种变化具有使用Normalizer的正确性和使用表的一些速度方面的优点。在我的机器上,这个答案比公认的答案快4倍,比@virgo47的答案慢6.6倍到7倍(公认的答案比我机器上的@virgo47的答案慢26倍)。

@David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String norm = Normalizer.normalize(string, Normalizer.Form.NFD);

    int j = 0;
    for (int i = 0, n = norm.length(); i < n; ++i) {
        char c = norm.charAt(i);
        int type = Character.getType(c);

        //Log.d(TAG,""+c);
        //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
        if (type != Character.NON_SPACING_MARK){
            out[j] = c;
            j++;
        }
    }
    //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
    return new String(out);
}

如果有人在kotlin中很难做到这一点,这段代码就像一个魅力。为了避免不一致,我也使用. touppercase和Trim()。然后我强制转换这个函数:

   fun stripAccents(s: String):String{

   if (s == null) {
      return "";
   }

val chars: CharArray = s.toCharArray()

var sb = StringBuilder(s)
var cont: Int = 0

while (chars.size > cont) {
    var c: kotlin.Char
    c = chars[cont]
    var c2:String = c.toString()
   //these are my needs, in case you need to convert other accents just Add new entries aqui
    c2 = c2.replace("Ã", "A")
    c2 = c2.replace("Õ", "O")
    c2 = c2.replace("Ç", "C")
    c2 = c2.replace("Á", "A")
    c2 = c2.replace("Ó", "O")
    c2 = c2.replace("Ê", "E")
    c2 = c2.replace("É", "E")
    c2 = c2.replace("Ú", "U")

    c = c2.single()
    sb.setCharAt(cont, c)
    cont++

}

return sb.toString()

}

要像这样使用这些有趣的转换代码:

     var str: String
     str = editText.text.toString() //get the text from EditText
     str = str.toUpperCase().trim()

     str = stripAccents(str) //call the function

编辑:如果你不困于Java <6,速度不是关键,/或翻译表太有限,请使用David的回答。重点是使用Normalizer(在Java 6中引入),而不是在循环中使用转换表。

虽然这不是“完美”的解决方案,但当你知道范围(在我们的例子中是latin1,2)时,它工作得很好,在Java 6之前工作(虽然不是一个真正的问题),并且比大多数建议的版本快得多(可能是也可能不是一个问题):

    /**
 * Mirror of the unicode table from 00c0 to 017f without diacritics.
 */
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
    "DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
    "aaaaaaaceeeeiiii" +
    "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
    "AaAaAaCcCcCcCcDd" +
    "DdEeEeEeEeEeGgGg" +
    "GgGgHhHhIiIiIiIi" +
    "IiJjJjKkkLlLlLlL" +
    "lLlNnNnNnnNnOoOo" +
    "OoOoRrRrRrSsSsSs" +
    "SsTtTtTtUuUuUuUu" +
    "UuUuWwYyYZzZzZzF";

/**
 * Returns string without diacritics - 7 bit approximation.
 *
 * @param source string to convert
 * @return corresponding string without diacritics
 */
public static String removeDiacritic(String source) {
    char[] vysl = new char[source.length()];
    char one;
    for (int i = 0; i < source.length(); i++) {
        one = source.charAt(i);
        if (one >= '\u00c0' && one <= '\u017f') {
            one = tab00c0.charAt((int) one - '\u00c0');
        }
        vysl[i] = one;
    }
    return new String(vysl);
}

在我使用32位JDK的HW上进行的测试表明,这在~100ms内执行了从àèéľšťč89FDČ到aeelstc89FDC的100万次转换,而Normalizer方式使其在3.7s(慢37倍)。如果您的需求与性能有关,并且您知道输入范围,那么这可能适合您。

喜欢:-)

我推荐Junidecode。它不仅可以处理'Ł'和'Ø',而且还可以很好地从其他字母(如汉语)转录成拉丁字母。