除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:
输入:或者čpžsíáýd
输出:orcpzsiayd
它不需要包括所有有口音的字母,比如俄语字母或汉语字母。
除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:
输入:或者čpžsíáýd
输出:orcpzsiayd
它不需要包括所有有口音的字母,比如俄语字母或汉语字母。
当前回答
我也遇到过与字符串相等性检查相关的相同问题,比较字符串中的一个 ASCII字符码128-255。
i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {} String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes())); System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes())); Output in Bytes: S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]
对于不同的空格及其字节码使用下面的代码:wiki for List_of_Unicode_characters
String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray =
// spacing_entities.getBytes( Charset.forName("UTF-8") );
// Charset.forName("UTF-8").encode( s2 ).array();
{-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
➩ ASCII transliterations of Unicode string for Java. unidecode String initials = Unidecode.decode( s2 ); ➩ using Guava: Google Core Libraries for Java. String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " ); For URL encode for the space use Guava laibrary. String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString); ➩ To overcome this problem used String.replaceAll() with some RegularExpression. // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. s2 = s2.replaceAll("\\p{Zs}", " "); s2 = s2.replaceAll("[^\\p{ASCII}]", " "); s2 = s2.replaceAll(" ", " "); ➩ Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them. s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
测试字符串和输出的不同方法,如➩Unidecode, Normalizer, StringUtils。
String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";
// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );
// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");
String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );
使用unidcode是最好的选择,我的最终代码如下所示。
public static void main(String[] args) {
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
String initials = Unidecode.decode( s2 );
if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
System.out.println("Equal Unicode Strings");
} else if( s1.equals( initials ) ) {
System.out.println("Equal Non Unicode Strings");
} else {
System.out.println("Not Equal");
}
}
其他回答
我也遇到过与字符串相等性检查相关的相同问题,比较字符串中的一个 ASCII字符码128-255。
i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {} String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes())); System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes())); Output in Bytes: S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]
对于不同的空格及其字节码使用下面的代码:wiki for List_of_Unicode_characters
String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray =
// spacing_entities.getBytes( Charset.forName("UTF-8") );
// Charset.forName("UTF-8").encode( s2 ).array();
{-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
➩ ASCII transliterations of Unicode string for Java. unidecode String initials = Unidecode.decode( s2 ); ➩ using Guava: Google Core Libraries for Java. String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " ); For URL encode for the space use Guava laibrary. String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString); ➩ To overcome this problem used String.replaceAll() with some RegularExpression. // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. s2 = s2.replaceAll("\\p{Zs}", " "); s2 = s2.replaceAll("[^\\p{ASCII}]", " "); s2 = s2.replaceAll(" ", " "); ➩ Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them. s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
测试字符串和输出的不同方法,如➩Unidecode, Normalizer, StringUtils。
String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";
// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );
// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");
String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );
使用unidcode是最好的选择,我的最终代码如下所示。
public static void main(String[] args) {
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
String initials = Unidecode.decode( s2 );
if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
System.out.println("Equal Unicode Strings");
} else if( s1.equals( initials ) ) {
System.out.println("Equal Non Unicode Strings");
} else {
System.out.println("Not Equal");
}
}
编辑:如果你不困于Java <6,速度不是关键,/或翻译表太有限,请使用David的回答。重点是使用Normalizer(在Java 6中引入),而不是在循环中使用转换表。
虽然这不是“完美”的解决方案,但当你知道范围(在我们的例子中是latin1,2)时,它工作得很好,在Java 6之前工作(虽然不是一个真正的问题),并且比大多数建议的版本快得多(可能是也可能不是一个问题):
/**
* Mirror of the unicode table from 00c0 to 017f without diacritics.
*/
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
"DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
"aaaaaaaceeeeiiii" +
"\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzF";
/**
* Returns string without diacritics - 7 bit approximation.
*
* @param source string to convert
* @return corresponding string without diacritics
*/
public static String removeDiacritic(String source) {
char[] vysl = new char[source.length()];
char one;
for (int i = 0; i < source.length(); i++) {
one = source.charAt(i);
if (one >= '\u00c0' && one <= '\u017f') {
one = tab00c0.charAt((int) one - '\u00c0');
}
vysl[i] = one;
}
return new String(vysl);
}
在我使用32位JDK的HW上进行的测试表明,这在~100ms内执行了从àèéľšťč89FDČ到aeelstc89FDC的100万次转换,而Normalizer方式使其在3.7s(慢37倍)。如果您的需求与性能有关,并且您知道输入范围,那么这可能适合您。
喜欢:-)
因为这个解决方案已经在Maven资源库的stringutils . striptones()中可用,并且可以在@DavidS提到的Ł中使用。 但我需要这是工作在Ø和Ł所以修改如下。可能对其他人也有帮助。
更新
这是StringUtils的修改版本。stripaccent (String obj),它包含旧的功能,同时处理Ø和Ł字符。
public static String stripAccents(final String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
for (int i = 0; i < decomposed.length(); i++) {
if (decomposed.charAt(i) == '\u0141') {
decomposed.setCharAt(i, 'L');
} else if (decomposed.charAt(i) == '\u0142') {
decomposed.setCharAt(i, 'l');
}else if (decomposed.charAt(i) == '\u00D8') {
decomposed.setCharAt(i, 'O');
}else if (decomposed.charAt(i) == '\u00F8') {
decomposed.setCharAt(i, 'o');
}
}
// Note that this doesn't correctly remove ligatures...
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}
输入字符串Ł Tĥïŝ 这是一个时髦的字符串O O
面对同样的问题,这里是使用Kotlin扩展的解决方案
val String.stripAccents: String
get() = Regex("\\p{InCombiningDiacriticalMarks}+")
.replace(
Normalizer.normalize(this, Normalizer.Form.NFD),
""
)
使用
val textWithoutAccents = "some accented string".stripAccents
如果有人在kotlin中很难做到这一点,这段代码就像一个魅力。为了避免不一致,我也使用. touppercase和Trim()。然后我强制转换这个函数:
fun stripAccents(s: String):String{
if (s == null) {
return "";
}
val chars: CharArray = s.toCharArray()
var sb = StringBuilder(s)
var cont: Int = 0
while (chars.size > cont) {
var c: kotlin.Char
c = chars[cont]
var c2:String = c.toString()
//these are my needs, in case you need to convert other accents just Add new entries aqui
c2 = c2.replace("Ã", "A")
c2 = c2.replace("Õ", "O")
c2 = c2.replace("Ç", "C")
c2 = c2.replace("Á", "A")
c2 = c2.replace("Ó", "O")
c2 = c2.replace("Ê", "E")
c2 = c2.replace("É", "E")
c2 = c2.replace("Ú", "U")
c = c2.single()
sb.setCharAt(cont, c)
cont++
}
return sb.toString()
}
要像这样使用这些有趣的转换代码:
var str: String
str = editText.text.toString() //get the text from EditText
str = str.toUpperCase().trim()
str = stripAccents(str) //call the function