除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:

输入:或者čpžsíáýd

输出:orcpzsiayd

它不需要包括所有有口音的字母,比如俄语字母或汉语字母。


当前回答

如果你没有库,使用regex和Normalizer的最好方法之一是:

    public String flattenToAscii(String s) {
                if(s == null || s.trim().length() == 0)
                        return "";
                return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}

这比replaceAll("[^\p{ASCII}]", ""))更有效,而且如果你不需要变音符符(就像你的例子一样)。

否则,您必须使用p{ASCII}模式。

的问候。

其他回答

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));

为我工作。上面代码段的输出给出了“aee”,这是我想要的,但是

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));

没有做任何替换。

因为这个解决方案已经在Maven资源库的stringutils . striptones()中可用,并且可以在@DavidS提到的Ł中使用。 但我需要这是工作在Ø和Ł所以修改如下。可能对其他人也有帮助。

更新


这是StringUtils的修改版本。stripaccent (String obj),它包含旧的功能,同时处理Ø和Ł字符。

public static String stripAccents(final String input) {
    if (input == null) {
        return null;
    }
    final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
    for (int i = 0; i < decomposed.length(); i++) {
        if (decomposed.charAt(i) == '\u0141') {
            decomposed.setCharAt(i, 'L');
        } else if (decomposed.charAt(i) == '\u0142') {
            decomposed.setCharAt(i, 'l');
        }else if (decomposed.charAt(i) == '\u00D8') {
            decomposed.setCharAt(i, 'O');
        }else if (decomposed.charAt(i) == '\u00F8') {
            decomposed.setCharAt(i, 'o');
        }
    }
    // Note that this doesn't correctly remove ligatures...
    return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}

输入字符串Ł Tĥïŝ 这是一个时髦的字符串O O

我也遇到过与字符串相等性检查相关的相同问题,比较字符串中的一个 ASCII字符码128-255。

i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {} String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes())); System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes())); Output in Bytes: S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]

对于不同的空格及其字节码使用下面的代码:wiki for List_of_Unicode_characters

String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray = 
    // spacing_entities.getBytes( Charset.forName("UTF-8") );
    // Charset.forName("UTF-8").encode( s2 ).array();
    {-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
    System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

➩ ASCII transliterations of Unicode string for Java. unidecode String initials = Unidecode.decode( s2 ); ➩ using Guava: Google Core Libraries for Java. String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " ); For URL encode for the space use Guava laibrary. String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString); ➩ To overcome this problem used String.replaceAll() with some RegularExpression. // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. s2 = s2.replaceAll("\\p{Zs}", " "); s2 = s2.replaceAll("[^\\p{ASCII}]", " "); s2 = s2.replaceAll(" ", " "); ➩ Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them. s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);


测试字符串和输出的不同方法,如➩Unidecode, Normalizer, StringUtils。

String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";

// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );

// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");

String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );

使用unidcode是最好的选择,我的最终代码如下所示。

public static void main(String[] args) {
    String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
    String initials = Unidecode.decode( s2 );
    if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
        System.out.println("Equal Unicode Strings");
    } else if( s1.equals( initials ) ) {
        System.out.println("Equal Non Unicode Strings");
    } else {
        System.out.println("Not Equal");
    }

}

如果你没有库,使用regex和Normalizer的最好方法之一是:

    public String flattenToAscii(String s) {
                if(s == null || s.trim().length() == 0)
                        return "";
                return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}

这比replaceAll("[^\p{ASCII}]", ""))更有效,而且如果你不需要变音符符(就像你的例子一样)。

否则,您必须使用p{ASCII}模式。

的问候。

一种快速安全的方式

public static String removeDiacritics(String str) {
    if (str == null)
        return null;
    if (str.isEmpty())
        return "";
    
    int len = str.length();
    StringBuilder sb
        = new StringBuilder(len);
    
    //iterate string codepoints
    for (int i = 0; i < len; ) {
        int codePoint = str.codePointAt(i);
        int charCount
            = Character.charCount(codePoint);
        
        if (charCount > 1) {
            for (int j = 0; j < charCount; j++)
                sb.append(str.charAt(i + j));
            i += charCount;
            continue;
        }
        else if (codePoint <= 127) {
            sb.append((char)codePoint);
            i++;
            continue;
        }
        
        sb.append(
            java.text.Normalizer
                .normalize(
                    Character.toString((char)codePoint),
                    java.text.Normalizer.Form.NFD)
                        .charAt(0));
        i++;
    }
    
    return sb.toString();
}