我试图转换一些字符串,在法国加拿大,基本上,我想能够拿出法国重音标记在字母,同时保持字母。(例如,将é转换为e,那么crème brûlée就会变成creme brulee)
实现这一目标的最佳方法是什么?
我试图转换一些字符串,在法国加拿大,基本上,我想能够拿出法国重音标记在字母,同时保持字母。(例如,将é转换为e,那么crème brûlée就会变成creme brulee)
实现这一目标的最佳方法是什么?
当前回答
为了像最初的问题一样简单地删除法语加拿大重音标记,这里有一个使用正则表达式而不是硬编码转换和For/Next循环的替代方法。根据您的需要,它可以被压缩成一行代码;但是,我将它添加到一个扩展类中,以便于重用。
Visual Basic
Imports System.Text
Imports System.Text.RegularExpressions
Public MustInherit Class StringExtension
Public Shared Function RemoveDiacritics(Text As String) As String
Return New Regex("\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), String.Empty)
End Function
End Class
实现
Private Shared Sub DoStuff()
MsgBox(StringExtension.RemoveDiacritics(inputString))
End Sub
c#
using System.Text;
using System.Text.RegularExpressions;
namespace YourApplication
{
public abstract class StringExtension
{
public static string RemoveDiacritics(string Text)
{
return new Regex(@"\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), string.Empty);
}
}
}
实现
private static void DoStuff()
{
MessageBox.Show(StringExtension.RemoveDiacritics(inputString));
}
Input: äáčďěéíľľňôóřŕšťúůýž ÄÁČĎĚÉÍĽĽŇÔŘŔŠŤÚŮÝŽ ÖÜË łŁđĐ țŢşŞçÇ øı
Output: aacdeeillnoorrstuuyz AACDEEILLNOORRSTUUYZ OUE łŁđĐ tTsScC øı
我加入了无法转换的字符,以帮助可视化接收到意外输入时会发生什么。
如果您还需要它来转换其他类型的字符,如波兰语的warsaw和Ł,那么根据您的需要,可以考虑合并这个答案(。NET Core友好),它使用CodePagesEncodingProvider到您的解决方案中。
其他回答
Imports System.Text
Imports System.Globalization
Public Function DECODE(ByVal x As String) As String
Dim sb As New StringBuilder
For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)
sb.Append(c)
Next
Return sb.ToString()
End Function
希腊代码页(ISO)可以做到这一点
关于这个代码页的信息在System.Text.Encoding.GetEncodings()中。了解网址:https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
希腊语(ISO)的代码页为28597,名称为ISO -8859-7。
进入代码…\ o /
string text = "Você está numa situação lamentável";
string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"
string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"
那么,写这个函数…
public string RemoveAcentuation(string text)
{
return
System.Web.HttpUtility.UrlDecode(
System.Web.HttpUtility.UrlEncode(
text, Encoding.GetEncoding("iso-8859-7")));
}
请注意,…Encoding. getencoding ("iso-8859-7")等价于Encoding. getencoding(28597),因为第一个是Encoding的名称,第二个是Encoding的编码页。
这是VB版本(工作与希腊):
导入系统。文本
导入系统。全球化
Public Function RemoveDiacritics(ByVal s As String)
Dim normalizedString As String
Dim stringBuilder As New StringBuilder
normalizedString = s.Normalize(NormalizationForm.FormD)
Dim i As Integer
Dim c As Char
For i = 0 To normalizedString.Length - 1
c = normalizedString(i)
If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
stringBuilder.Append(c)
End If
Next
Return stringBuilder.ToString()
End Function
在这里弹出这个库,如果您还没有考虑过的话。看起来有一个完整的单元测试。
https://github.com/thomasgalliker/Diacritics.NET
c#字符串扩展方法
我认为保留字符串含义的最佳解决方案是转换字符,而不是剥离它们,示例crème brûlée很好地说明了这一点,即crme brle vs. creme brulee。
我查看了上面Alexander的评论,看到了Lucene。Net代码是Apache 2.0许可的,因此我将该类修改为一个简单的字符串扩展方法。你可以这样使用它:
var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength);
// "creme brulee"
这个函数太长了,不能在StackOverflow的答案中发布(~139k字符的30k允许lol),所以我做了一个要点,并将作者的名字归为:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
///
/// <ul>
/// <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
/// <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
/// <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
/// <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
/// <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
/// <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
/// <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
/// <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
/// <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
/// <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
/// <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
/// <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
/// <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
/// <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
/// <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
/// <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
/// <summary>
/// Converts characters above ASCII to their ASCII equivalents. For example,
/// accents are removed from accented characters.
/// </summary>
/// <param name="input"> The string of characters to fold </param>
/// <param name="length"> The length of the folded return string </param>
/// <returns> length of output </returns>
public static string FoldToASCII(this string input, int? length = null)
{
// See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
}
}
希望这能帮助到其他人,这是我发现的最强大的解决方案!