除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:

输入:或者čpžsíáýd

输出:orcpzsiayd

它不需要包括所有有口音的字母,比如俄语字母或汉语字母。


当前回答

一种快速安全的方式

public static String removeDiacritics(String str) {
    if (str == null)
        return null;
    if (str.isEmpty())
        return "";
    
    int len = str.length();
    StringBuilder sb
        = new StringBuilder(len);
    
    //iterate string codepoints
    for (int i = 0; i < len; ) {
        int codePoint = str.codePointAt(i);
        int charCount
            = Character.charCount(codePoint);
        
        if (charCount > 1) {
            for (int j = 0; j < charCount; j++)
                sb.append(str.charAt(i + j));
            i += charCount;
            continue;
        }
        else if (codePoint <= 127) {
            sb.append((char)codePoint);
            i++;
            continue;
        }
        
        sb.append(
            java.text.Normalizer
                .normalize(
                    Character.toString((char)codePoint),
                    java.text.Normalizer.Form.NFD)
                        .charAt(0));
        i++;
    }
    
    return sb.toString();
}

其他回答

因为这个解决方案已经在Maven资源库的stringutils . striptones()中可用,并且可以在@DavidS提到的Ł中使用。 但我需要这是工作在Ø和Ł所以修改如下。可能对其他人也有帮助。

更新


这是StringUtils的修改版本。stripaccent (String obj),它包含旧的功能,同时处理Ø和Ł字符。

public static String stripAccents(final String input) {
    if (input == null) {
        return null;
    }
    final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
    for (int i = 0; i < decomposed.length(); i++) {
        if (decomposed.charAt(i) == '\u0141') {
            decomposed.setCharAt(i, 'L');
        } else if (decomposed.charAt(i) == '\u0142') {
            decomposed.setCharAt(i, 'l');
        }else if (decomposed.charAt(i) == '\u00D8') {
            decomposed.setCharAt(i, 'O');
        }else if (decomposed.charAt(i) == '\u00F8') {
            decomposed.setCharAt(i, 'o');
        }
    }
    // Note that this doesn't correctly remove ligatures...
    return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}

输入字符串Ł Tĥïŝ 这是一个时髦的字符串O O

@David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String norm = Normalizer.normalize(string, Normalizer.Form.NFD);

    int j = 0;
    for (int i = 0, n = norm.length(); i < n; ++i) {
        char c = norm.charAt(i);
        int type = Character.getType(c);

        //Log.d(TAG,""+c);
        //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
        if (type != Character.NON_SPACING_MARK){
            out[j] = c;
            j++;
        }
    }
    //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
    return new String(out);
}

从java.text.Normalizer开始。

string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction 

这将从大多数字符中分离所有重音符号。然后,你只需要将每个字符与字母进行比较,并排除那些不是字母的字符。

string = string.replaceAll("[^\\p{ASCII}]", "");

如果你的文本是Unicode,你应该使用这个:

string = string.replaceAll("\\p{M}", "");

对于Unicode, \\P{M}匹配基本字形,\\P{M}(小写)匹配每个重音。

感谢GarretWilson提供的指针和正则表达式.info提供的Unicode指南。


需要注意的是,Normalizer本身不足以删除变音符。例如,下面的代码不会将重音的é替换为不重音的e:

import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;

public class T {
  public static void main( final String[] args ) {
    final var text = "Brévis";

    System.out.println(
      normalize( text, NFD ) + " " + 
      normalize( text, NFC ) + " " + 
      normalize( text, NFKD ) + " " + 
      normalize( text, NFKC )
    );
  }
}

我推荐Junidecode。它不仅可以处理'Ł'和'Ø',而且还可以很好地从其他字母(如汉语)转录成拉丁字母。

一种快速安全的方式

public static String removeDiacritics(String str) {
    if (str == null)
        return null;
    if (str.isEmpty())
        return "";
    
    int len = str.length();
    StringBuilder sb
        = new StringBuilder(len);
    
    //iterate string codepoints
    for (int i = 0; i < len; ) {
        int codePoint = str.codePointAt(i);
        int charCount
            = Character.charCount(codePoint);
        
        if (charCount > 1) {
            for (int j = 0; j < charCount; j++)
                sb.append(str.charAt(i + j));
            i += charCount;
            continue;
        }
        else if (codePoint <= 127) {
            sb.append((char)codePoint);
            i++;
            continue;
        }
        
        sb.append(
            java.text.Normalizer
                .normalize(
                    Character.toString((char)codePoint),
                    java.text.Normalizer.Form.NFD)
                        .charAt(0));
        i++;
    }
    
    return sb.toString();
}