除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:
输入:或者čpžsíáýd
输出:orcpzsiayd
它不需要包括所有有口音的字母,比如俄语字母或汉语字母。
除了使用String.replaceAll()方法并逐个替换字母之外,还有更好的方法来摆脱重音并使这些字母规则吗? 例子:
输入:或者čpžsíáýd
输出:orcpzsiayd
它不需要包括所有有口音的字母,比如俄语字母或汉语字母。
当前回答
一种快速安全的方式
public static String removeDiacritics(String str) {
if (str == null)
return null;
if (str.isEmpty())
return "";
int len = str.length();
StringBuilder sb
= new StringBuilder(len);
//iterate string codepoints
for (int i = 0; i < len; ) {
int codePoint = str.codePointAt(i);
int charCount
= Character.charCount(codePoint);
if (charCount > 1) {
for (int j = 0; j < charCount; j++)
sb.append(str.charAt(i + j));
i += charCount;
continue;
}
else if (codePoint <= 127) {
sb.append((char)codePoint);
i++;
continue;
}
sb.append(
java.text.Normalizer
.normalize(
Character.toString((char)codePoint),
java.text.Normalizer.Form.NFD)
.charAt(0));
i++;
}
return sb.toString();
}
其他回答
我也遇到过与字符串相等性检查相关的相同问题,比较字符串中的一个 ASCII字符码128-255。
i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {} String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes())); System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes())); Output in Bytes: S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]
对于不同的空格及其字节码使用下面的代码:wiki for List_of_Unicode_characters
String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray =
// spacing_entities.getBytes( Charset.forName("UTF-8") );
// Charset.forName("UTF-8").encode( s2 ).array();
{-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
➩ ASCII transliterations of Unicode string for Java. unidecode String initials = Unidecode.decode( s2 ); ➩ using Guava: Google Core Libraries for Java. String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " ); For URL encode for the space use Guava laibrary. String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString); ➩ To overcome this problem used String.replaceAll() with some RegularExpression. // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. s2 = s2.replaceAll("\\p{Zs}", " "); s2 = s2.replaceAll("[^\\p{ASCII}]", " "); s2 = s2.replaceAll(" ", " "); ➩ Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them. s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
测试字符串和输出的不同方法,如➩Unidecode, Normalizer, StringUtils。
String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";
// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );
// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");
String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );
使用unidcode是最好的选择,我的最终代码如下所示。
public static void main(String[] args) {
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
String initials = Unidecode.decode( s2 );
if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
System.out.println("Equal Unicode Strings");
} else if( s1.equals( initials ) ) {
System.out.println("Equal Non Unicode Strings");
} else {
System.out.println("Not Equal");
}
}
因为这个解决方案已经在Maven资源库的stringutils . striptones()中可用,并且可以在@DavidS提到的Ł中使用。 但我需要这是工作在Ø和Ł所以修改如下。可能对其他人也有帮助。
更新
这是StringUtils的修改版本。stripaccent (String obj),它包含旧的功能,同时处理Ø和Ł字符。
public static String stripAccents(final String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
for (int i = 0; i < decomposed.length(); i++) {
if (decomposed.charAt(i) == '\u0141') {
decomposed.setCharAt(i, 'L');
} else if (decomposed.charAt(i) == '\u0142') {
decomposed.setCharAt(i, 'l');
}else if (decomposed.charAt(i) == '\u00D8') {
decomposed.setCharAt(i, 'O');
}else if (decomposed.charAt(i) == '\u00F8') {
decomposed.setCharAt(i, 'o');
}
}
// Note that this doesn't correctly remove ligatures...
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}
输入字符串Ł Tĥïŝ 这是一个时髦的字符串O O
面对同样的问题,这里是使用Kotlin扩展的解决方案
val String.stripAccents: String
get() = Regex("\\p{InCombiningDiacriticalMarks}+")
.replace(
Normalizer.normalize(this, Normalizer.Form.NFD),
""
)
使用
val textWithoutAccents = "some accented string".stripAccents
根据语言的不同,这些可能不被认为是重音(改变字母的发音),而是变音符符号
https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics
“波斯尼亚语和克罗地亚语都有符号“č”、“ovic”、“đ”、“š”和“ž”,这些符号被认为是单独的字母,在字典和其他按字母顺序排列单词的语境中都是这样列出的。”
删除它们可能会从本质上改变单词的意思,或者将字母变成完全不同的字母。
@virgo47的解决方案非常快,但很接近。接受的答案使用Normalizer和正则表达式。我想知道Normalizer和正则表达式占用了多少时间,因为删除所有非ascii字符可以在没有正则表达式的情况下完成:
import java.text.Normalizer;
public class Strip {
public static String flattenToAscii(String string) {
StringBuilder sb = new StringBuilder(string.length());
string = Normalizer.normalize(string, Normalizer.Form.NFD);
for (char c : string.toCharArray()) {
if (c <= '\u007F') sb.append(c);
}
return sb.toString();
}
}
小的额外加速可以通过写入char[]而不调用toCharArray()来获得,尽管我不确定代码清晰度的降低是否值得这样做:
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
string = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = string.length(); i < n; ++i) {
char c = string.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}
这种变化具有使用Normalizer的正确性和使用表的一些速度方面的优点。在我的机器上,这个答案比公认的答案快4倍,比@virgo47的答案慢6.6倍到7倍(公认的答案比我机器上的@virgo47的答案慢26倍)。