除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:

输入:或者čpžsíáýd

输出:orcpzsiayd

它不需要包括所有有口音的字母,比如俄语字母或汉语字母。


当前回答

面对同样的问题,这里是使用Kotlin扩展的解决方案

   val String.stripAccents: String
    get() = Regex("\\p{InCombiningDiacriticalMarks}+")
        .replace(
            Normalizer.normalize(this, Normalizer.Form.NFD),
            ""
        )

使用

val textWithoutAccents = "some accented string".stripAccents

其他回答

@David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String norm = Normalizer.normalize(string, Normalizer.Form.NFD);

    int j = 0;
    for (int i = 0, n = norm.length(); i < n; ++i) {
        char c = norm.charAt(i);
        int type = Character.getType(c);

        //Log.d(TAG,""+c);
        //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
        if (type != Character.NON_SPACING_MARK){
            out[j] = c;
            j++;
        }
    }
    //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
    return new String(out);
}

根据语言的不同,这些可能不被认为是重音(改变字母的发音),而是变音符符号

https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

“波斯尼亚语和克罗地亚语都有符号“č”、“ovic”、“đ”、“š”和“ž”,这些符号被认为是单独的字母,在字典和其他按字母顺序排列单词的语境中都是这样列出的。”

删除它们可能会从本质上改变单词的意思,或者将字母变成完全不同的字母。

@virgo47的解决方案非常快,但很接近。接受的答案使用Normalizer和正则表达式。我想知道Normalizer和正则表达式占用了多少时间,因为删除所有非ascii字符可以在没有正则表达式的情况下完成:

import java.text.Normalizer;

public class Strip {
    public static String flattenToAscii(String string) {
        StringBuilder sb = new StringBuilder(string.length());
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        for (char c : string.toCharArray()) {
            if (c <= '\u007F') sb.append(c);
        }
        return sb.toString();
    }
}

小的额外加速可以通过写入char[]而不调用toCharArray()来获得,尽管我不确定代码清晰度的降低是否值得这样做:

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = string.length(); i < n; ++i) {
        char c = string.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}

这种变化具有使用Normalizer的正确性和使用表的一些速度方面的优点。在我的机器上,这个答案比公认的答案快4倍,比@virgo47的答案慢6.6倍到7倍(公认的答案比我机器上的@virgo47的答案慢26倍)。

因为这个解决方案已经在Maven资源库的stringutils . striptones()中可用,并且可以在@DavidS提到的Ł中使用。 但我需要这是工作在Ø和Ł所以修改如下。可能对其他人也有帮助。

更新


这是StringUtils的修改版本。stripaccent (String obj),它包含旧的功能,同时处理Ø和Ł字符。

public static String stripAccents(final String input) {
    if (input == null) {
        return null;
    }
    final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
    for (int i = 0; i < decomposed.length(); i++) {
        if (decomposed.charAt(i) == '\u0141') {
            decomposed.setCharAt(i, 'L');
        } else if (decomposed.charAt(i) == '\u0142') {
            decomposed.setCharAt(i, 'l');
        }else if (decomposed.charAt(i) == '\u00D8') {
            decomposed.setCharAt(i, 'O');
        }else if (decomposed.charAt(i) == '\u00F8') {
            decomposed.setCharAt(i, 'o');
        }
    }
    // Note that this doesn't correctly remove ligatures...
    return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}

输入字符串Ł Tĥïŝ 这是一个时髦的字符串O O

我推荐Junidecode。它不仅可以处理'Ł'和'Ø',而且还可以很好地从其他字母(如汉语)转录成拉丁字母。