除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:

输入:或者čpžsíáýd

输出:orcpzsiayd

它不需要包括所有有口音的字母,比如俄语字母或汉语字母。


当前回答

@virgo47的解决方案非常快,但很接近。接受的答案使用Normalizer和正则表达式。我想知道Normalizer和正则表达式占用了多少时间,因为删除所有非ascii字符可以在没有正则表达式的情况下完成:

import java.text.Normalizer;

public class Strip {
    public static String flattenToAscii(String string) {
        StringBuilder sb = new StringBuilder(string.length());
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        for (char c : string.toCharArray()) {
            if (c <= '\u007F') sb.append(c);
        }
        return sb.toString();
    }
}

小的额外加速可以通过写入char[]而不调用toCharArray()来获得,尽管我不确定代码清晰度的降低是否值得这样做:

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = string.length(); i < n; ++i) {
        char c = string.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}

这种变化具有使用Normalizer的正确性和使用表的一些速度方面的优点。在我的机器上,这个答案比公认的答案快4倍,比@virgo47的答案慢6.6倍到7倍(公认的答案比我机器上的@virgo47的答案慢26倍)。

其他回答

一种快速安全的方式

public static String removeDiacritics(String str) {
    if (str == null)
        return null;
    if (str.isEmpty())
        return "";
    
    int len = str.length();
    StringBuilder sb
        = new StringBuilder(len);
    
    //iterate string codepoints
    for (int i = 0; i < len; ) {
        int codePoint = str.codePointAt(i);
        int charCount
            = Character.charCount(codePoint);
        
        if (charCount > 1) {
            for (int j = 0; j < charCount; j++)
                sb.append(str.charAt(i + j));
            i += charCount;
            continue;
        }
        else if (codePoint <= 127) {
            sb.append((char)codePoint);
            i++;
            continue;
        }
        
        sb.append(
            java.text.Normalizer
                .normalize(
                    Character.toString((char)codePoint),
                    java.text.Normalizer.Form.NFD)
                        .charAt(0));
        i++;
    }
    
    return sb.toString();
}

我也遇到过与字符串相等性检查相关的相同问题,比较字符串中的一个 ASCII字符码128-255。

i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {} String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes())); System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes())); Output in Bytes: S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]

对于不同的空格及其字节码使用下面的代码:wiki for List_of_Unicode_characters

String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray = 
    // spacing_entities.getBytes( Charset.forName("UTF-8") );
    // Charset.forName("UTF-8").encode( s2 ).array();
    {-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
    System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

➩ ASCII transliterations of Unicode string for Java. unidecode String initials = Unidecode.decode( s2 ); ➩ using Guava: Google Core Libraries for Java. String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " ); For URL encode for the space use Guava laibrary. String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString); ➩ To overcome this problem used String.replaceAll() with some RegularExpression. // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. s2 = s2.replaceAll("\\p{Zs}", " "); s2 = s2.replaceAll("[^\\p{ASCII}]", " "); s2 = s2.replaceAll(" ", " "); ➩ Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them. s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);


测试字符串和输出的不同方法,如➩Unidecode, Normalizer, StringUtils。

String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";

// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );

// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");

String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );

使用unidcode是最好的选择,我的最终代码如下所示。

public static void main(String[] args) {
    String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
    String initials = Unidecode.decode( s2 );
    if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
        System.out.println("Equal Unicode Strings");
    } else if( s1.equals( initials ) ) {
        System.out.println("Equal Non Unicode Strings");
    } else {
        System.out.println("Not Equal");
    }

}

如果有人在kotlin中很难做到这一点,这段代码就像一个魅力。为了避免不一致,我也使用. touppercase和Trim()。然后我强制转换这个函数:

   fun stripAccents(s: String):String{

   if (s == null) {
      return "";
   }

val chars: CharArray = s.toCharArray()

var sb = StringBuilder(s)
var cont: Int = 0

while (chars.size > cont) {
    var c: kotlin.Char
    c = chars[cont]
    var c2:String = c.toString()
   //these are my needs, in case you need to convert other accents just Add new entries aqui
    c2 = c2.replace("Ã", "A")
    c2 = c2.replace("Õ", "O")
    c2 = c2.replace("Ç", "C")
    c2 = c2.replace("Á", "A")
    c2 = c2.replace("Ó", "O")
    c2 = c2.replace("Ê", "E")
    c2 = c2.replace("É", "E")
    c2 = c2.replace("Ú", "U")

    c = c2.single()
    sb.setCharAt(cont, c)
    cont++

}

return sb.toString()

}

要像这样使用这些有趣的转换代码:

     var str: String
     str = editText.text.toString() //get the text from EditText
     str = str.toUpperCase().trim()

     str = stripAccents(str) //call the function

编辑:如果你不困于Java <6,速度不是关键,/或翻译表太有限,请使用David的回答。重点是使用Normalizer(在Java 6中引入),而不是在循环中使用转换表。

虽然这不是“完美”的解决方案,但当你知道范围(在我们的例子中是latin1,2)时,它工作得很好,在Java 6之前工作(虽然不是一个真正的问题),并且比大多数建议的版本快得多(可能是也可能不是一个问题):

    /**
 * Mirror of the unicode table from 00c0 to 017f without diacritics.
 */
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
    "DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
    "aaaaaaaceeeeiiii" +
    "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
    "AaAaAaCcCcCcCcDd" +
    "DdEeEeEeEeEeGgGg" +
    "GgGgHhHhIiIiIiIi" +
    "IiJjJjKkkLlLlLlL" +
    "lLlNnNnNnnNnOoOo" +
    "OoOoRrRrRrSsSsSs" +
    "SsTtTtTtUuUuUuUu" +
    "UuUuWwYyYZzZzZzF";

/**
 * Returns string without diacritics - 7 bit approximation.
 *
 * @param source string to convert
 * @return corresponding string without diacritics
 */
public static String removeDiacritic(String source) {
    char[] vysl = new char[source.length()];
    char one;
    for (int i = 0; i < source.length(); i++) {
        one = source.charAt(i);
        if (one >= '\u00c0' && one <= '\u017f') {
            one = tab00c0.charAt((int) one - '\u00c0');
        }
        vysl[i] = one;
    }
    return new String(vysl);
}

在我使用32位JDK的HW上进行的测试表明,这在~100ms内执行了从àèéľšťč89FDČ到aeelstc89FDC的100万次转换,而Normalizer方式使其在3.7s(慢37倍)。如果您的需求与性能有关,并且您知道输入范围,那么这可能适合您。

喜欢:-)

我认为最好的解决方案是将每个char转换为HEX,并用另一个HEX替换它。因为有两种Unicode类型:

Composite Unicode
Precomposed Unicode

例如,Composite Unicode编写的“Ồ”不同于precompose Unicode编写的“Ồ”。您可以复制我的示例字符并转换它们以查看差异。

In Composite Unicode, "Ồ" is combined from 2 char: Ô (U+00d4) and ̀ (U+0300)
In Precomposed Unicode, "Ồ" is single char (U+1ED2)

我为一些银行开发了这个功能,以便在将信息发送到核心银行(通常不支持Unicode)之前转换信息,当最终用户使用多种Unicode类型输入数据时,就会遇到这个问题。所以我认为,转换为HEX并替换它是最可靠的方法。