我试图转换一些字符串,在法国加拿大,基本上,我想能够拿出法国重音标记在字母,同时保持字母。(例如,将é转换为e,那么crème brûlée就会变成creme brulee)
实现这一目标的最佳方法是什么?
我试图转换一些字符串,在法国加拿大,基本上,我想能够拿出法国重音标记在字母,同时保持字母。(例如,将é转换为e,那么crème brûlée就会变成creme brulee)
实现这一目标的最佳方法是什么?
当前回答
你可以从MMLib中使用字符串扩展。扩展nuget包:
using MMLib.RapidPrototyping.Generators;
public void ExtensionsExample()
{
string target = "aácčeéií";
Assert.AreEqual("aacceeii", target.RemoveDiacritics());
}
Nuget页面:https://www.nuget.org/packages/MMLib.Extensions/ Codeplex项目网站https://mmlib.codeplex.com/
其他回答
这招对我很管用……
string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);
快速短!
我没有使用过这种方法,但是Michael Kaplan在他的博客文章(有一个令人困惑的标题)中描述了一种方法,谈论剥离变音符:剥离是一项有趣的工作(又名剥离) 论无意义的意义,即一切 Mn字符是非空格的,但是 有些更非间距比 其他人)
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder(capacity: normalizedString.Length);
for (int i = 0; i < normalizedString.Length; i++)
{
char c = normalizedString[i];
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder
.ToString()
.Normalize(NormalizationForm.FormC);
}
请注意,这是他之前帖子的后续:剥离变音符....
该方法使用String。Normalize将输入字符串分割为组成符号(基本上是将“基本”字符与变音符符分开),然后扫描结果并仅保留基本字符。这只是有点复杂,但实际上你看到的是一个复杂的问题。
当然,如果你限制自己使用法语,你可能会使用@David Dibben推荐的如何在c++ std::string中删除重音和波浪号的简单基于表的方法。
没有足够的声誉,显然我不能评论亚历山大的优秀链接。Lucene似乎是唯一的解决方案在合理的通用情况下工作。
对于那些想要一个简单的复制粘贴解决方案的人,这里是利用Lucene中的代码:
字符串试验台= " AAAACEIIOOØUUÞaaaaaaæceeeeiiiið人参公鸡øUUāăčĐęğıŁłńŌōřŞşšźžșțệủ”;
Console.WriteLine (Lucene.latinizeLucene(实验);
AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu
//////////
public static class Lucene
{
// source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
// idea: https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net (scroll down, search for lucene by Alexander)
public static string latinizeLucene(string arg)
{
char[] argChar = arg.ToCharArray();
// latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ⑽ to (10)
char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();
int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);
string ret = new string(resultChar);
ret = ret.Substring(0, outputPos);
return ret;
}
/// <summary>
/// Converts characters above ASCII to their ASCII equivalents. For example,
/// accents are removed from accented characters.
/// <para/>
/// @lucene.internal
/// </summary>
/// <param name="input"> The characters to fold </param>
/// <param name="inputPos"> Index of the first character to fold </param>
/// <param name="output"> The result of the folding. Should be of size >= <c>length * 4</c>. </param>
/// <param name="outputPos"> Index of output where to put the result of the folding </param>
/// <param name="length"> The number of characters to fold </param>
/// <returns> length of output </returns>
private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
{
int end = inputPos + length;
for (int pos = inputPos; pos < end; ++pos)
{
char c = input[pos];
// Quick test: if it's not in range then just keep current character
if (c < '\u0080')
{
output[outputPos++] = c;
}
else
{
switch (c)
{
case '\u00C0': // À [LATIN CAPITAL LETTER A WITH GRAVE]
case '\u00C1': // Á [LATIN CAPITAL LETTER A WITH ACUTE]
case '\u00C2': // Â [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
case '\u00C3': // Ã [LATIN CAPITAL LETTER A WITH TILDE]
case '\u00C4': // Ä [LATIN CAPITAL LETTER A WITH DIAERESIS]
case '\u00C5': // Å [LATIN CAPITAL LETTER A WITH RING ABOVE]
case '\u0100': // Ā [LATIN CAPITAL LETTER A WITH MACRON]
case '\u0102': // Ă [LATIN CAPITAL LETTER A WITH BREVE]
case '\u0104': // Ą [LATIN CAPITAL LETTER A WITH OGONEK]
case '\u018F': // Ə http://en.wikipedia.org/wiki/Schwa [LATIN CAPITAL LETTER SCHWA]
case '\u01CD': // Ǎ [LATIN CAPITAL LETTER A WITH CARON]
case '\u01DE': // Ǟ [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
case '\u01E0': // Ǡ [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
case '\u01FA': // Ǻ [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
case '\u0200': // Ȁ [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
case '\u0202': // Ȃ [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
case '\u0226': // Ȧ [LATIN CAPITAL LETTER A WITH DOT ABOVE]
case '\u023A': // Ⱥ [LATIN CAPITAL LETTER A WITH STROKE]
case '\u1D00': // ᴀ [LATIN LETTER SMALL CAPITAL A]
case '\u1E00': // Ḁ [LATIN CAPITAL LETTER A WITH RING BELOW]
case '\u1EA0': // Ạ [LATIN CAPITAL LETTER A WITH DOT BELOW]
case '\u1EA2': // Ả [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
case '\u1EA4': // Ấ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
case '\u1EA6': // Ầ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
case '\u1EA8': // Ẩ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
case '\u1EAA': // Ẫ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
case '\u1EAC': // Ậ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
case '\u1EAE': // Ắ [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
case '\u1EB0': // Ằ [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
case '\u1EB2': // Ẳ [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
case '\u1EB4': // Ẵ [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
case '\u1EB6': // Ặ [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
case '\u24B6': // Ⓐ [CIRCLED LATIN CAPITAL LETTER A]
case '\uFF21': // A [FULLWIDTH LATIN CAPITAL LETTER A]
output[outputPos++] = 'A';
break;
case '\u00E0': // à [LATIN SMALL LETTER A WITH GRAVE]
case '\u00E1': // á [LATIN SMALL LETTER A WITH ACUTE]
case '\u00E2': // â [LATIN SMALL LETTER A WITH CIRCUMFLEX]
case '\u00E3': // ã [LATIN SMALL LETTER A WITH TILDE]
case '\u00E4': // ä [LATIN SMALL LETTER A WITH DIAERESIS]
case '\u00E5': // å [LATIN SMALL LETTER A WITH RING ABOVE]
case '\u0101': // ā [LATIN SMALL LETTER A WITH MACRON]
case '\u0103': // ă [LATIN SMALL LETTER A WITH BREVE]
case '\u0105': // ą [LATIN SMALL LETTER A WITH OGONEK]
case '\u01CE': // ǎ [LATIN SMALL LETTER A WITH CARON]
case '\u01DF': // ǟ [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
case '\u01E1': // ǡ [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
case '\u01FB': // ǻ [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
case '\u0201': // ȁ [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
case '\u0203': // ȃ [LATIN SMALL LETTER A WITH INVERTED BREVE]
case '\u0227': // ȧ [LATIN SMALL LETTER A WITH DOT ABOVE]
case '\u0250': // ɐ [LATIN SMALL LETTER TURNED A]
case '\u0259': // ə [LATIN SMALL LETTER SCHWA]
case '\u025A': // ɚ [LATIN SMALL LETTER SCHWA WITH HOOK]
case '\u1D8F': // ᶏ [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
case '\u1D95': // ᶕ [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
case '\u1E01': // ạ [LATIN SMALL LETTER A WITH RING BELOW]
case '\u1E9A': // ả [LATIN SMALL LETTER A WITH RIGHT HALF RING]
case '\u1EA1': // ạ [LATIN SMALL LETTER A WITH DOT BELOW]
case '\u1EA3': // ả [LATIN SMALL LETTER A WITH HOOK ABOVE]
case '\u1EA5': // ấ [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
case '\u1EA7': // ầ [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
case '\u1EA9': // ẩ [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
case '\u1EAB': // ẫ [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
case '\u1EAD': // ậ [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
case '\u1EAF': // ắ [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
case '\u1EB1': // ằ [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
case '\u1EB3': // ẳ [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
case '\u1EB5': // ẵ [LATIN SMALL LETTER A WITH BREVE AND TILDE]
case '\u1EB7': // ặ [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
case '\u2090': // ₐ [LATIN SUBSCRIPT SMALL LETTER A]
case '\u2094': // ₔ [LATIN SUBSCRIPT SMALL LETTER SCHWA]
case '\u24D0': // ⓐ [CIRCLED LATIN SMALL LETTER A]
case '\u2C65': // ⱥ [LATIN SMALL LETTER A WITH STROKE]
case '\u2C6F': // Ɐ [LATIN CAPITAL LETTER TURNED A]
case '\uFF41': // a [FULLWIDTH LATIN SMALL LETTER A]
output[outputPos++] = 'a';
break;
case '\uA732': // Ꜳ [LATIN CAPITAL LETTER AA]
output[outputPos++] = 'A';
output[outputPos++] = 'A';
break;
case '\u00C6': // Æ [LATIN CAPITAL LETTER AE]
case '\u01E2': // Ǣ [LATIN CAPITAL LETTER AE WITH MACRON]
case '\u01FC': // Ǽ [LATIN CAPITAL LETTER AE WITH ACUTE]
case '\u1D01': // ᴁ [LATIN LETTER SMALL CAPITAL AE]
output[outputPos++] = 'A';
output[outputPos++] = 'E';
break;
case '\uA734': // Ꜵ [LATIN CAPITAL LETTER AO]
output[outputPos++] = 'A';
output[outputPos++] = 'O';
break;
case '\uA736': // Ꜷ [LATIN CAPITAL LETTER AU]
output[outputPos++] = 'A';
output[outputPos++] = 'U';
break;
// etc. etc. etc.
// see link above for complete source code
//
// unfortunately, postings are limited, as in
// "Body is limited to 30000 characters; you entered 136098."
[...]
case '\u2053': // ⁓ [SWUNG DASH]
case '\uFF5E': // ~ [FULLWIDTH TILDE]
output[outputPos++] = '~';
break;
default:
output[outputPos++] = c;
break;
}
}
}
return outputPos;
}
}
我经常使用基于我在这里找到的另一个版本的扩展方法 (参见在c# (ascii)中替换字符) 简单解释一下:
归一化形成D,将è等字符分割为e和非空格' 由此,nospacing字符被移除 结果归一化回形式C(我不确定这是否有必要)
代码:
using System.Linq;
using System.Text;
using System.Globalization;
// namespace here
public static class Utility
{
public static string RemoveDiacritics(this string str)
{
if (null == str) return null;
var chars =
from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
let uc = CharUnicodeInfo.GetUnicodeCategory(c)
where uc != UnicodeCategory.NonSpacingMark
select c;
var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
return cleanStr;
}
// or, alternatively
public static string RemoveDiacritics2(this string str)
{
if (null == str) return null;
var chars = str
.Normalize(NormalizationForm.FormD)
.ToCharArray()
.Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
.ToArray();
return new string(chars).Normalize(NormalizationForm.FormC);
}
}
试试helppersharp包。
有一个方法removeaccent:
public static string RemoveAccents(this string source)
{
//8 bit characters
byte[] b = Encoding.GetEncoding(1251).GetBytes(source);
// 7 bit characters
string t = Encoding.ASCII.GetString(b);
Regex re = new Regex("[^a-zA-Z0-9]=-_/");
string c = re.Replace(t, " ");
return c;
}