我试图转换一些字符串,在法国加拿大,基本上,我想能够拿出法国重音标记在字母,同时保持字母。(例如,将é转换为e,那么crème brûlée就会变成creme brulee)

实现这一目标的最佳方法是什么?


当前回答

这个人说:

Encoding.ASCII.GetString(Encoding.GetEncoding(1251).获取字节(文本));

它实际上把å这样的一个字符(它是字符代码00E5,而不是0061加上修饰符030A,看起来是一样的)分割成一个加上某种修饰符,然后ASCII转换删除修饰符,只留下a。

其他回答

c#字符串扩展方法

我认为保留字符串含义的最佳解决方案是转换字符,而不是剥离它们,示例crème brûlée很好地说明了这一点,即crme brle vs. creme brulee。

我查看了上面Alexander的评论,看到了Lucene。Net代码是Apache 2.0许可的,因此我将该类修改为一个简单的字符串扩展方法。你可以这样使用它:

var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength); 
// "creme brulee"

这个函数太长了,不能在StackOverflow的答案中发布(~139k字符的30k允许lol),所以我做了一个要点,并将作者的名字归为:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
/// 
/// <ul>
///   <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
///   <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
///   <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
///   <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
///   <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
///   <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
///   <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
///   <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
///   <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
///   <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
///   <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
///   <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
///   <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
///   <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
///   <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
///   <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&amp;agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// </summary>
    /// <param name="input">     The string of characters to fold </param>
    /// <param name="length">    The length of the folded return string </param>
    /// <returns> length of output </returns>
    public static string FoldToASCII(this string input, int? length = null)
    {
        // See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
    }
}

希望这能帮助到其他人,这是我发现的最强大的解决方案!

Imports System.Text
Imports System.Globalization

 Public Function DECODE(ByVal x As String) As String
        Dim sb As New StringBuilder
        For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)  
            sb.Append(c)
        Next
        Return sb.ToString()
    End Function

为了像最初的问题一样简单地删除法语加拿大重音标记,这里有一个使用正则表达式而不是硬编码转换和For/Next循环的替代方法。根据您的需要,它可以被压缩成一行代码;但是,我将它添加到一个扩展类中,以便于重用。

Visual Basic

Imports System.Text
Imports System.Text.RegularExpressions

Public MustInherit Class StringExtension
    Public Shared Function RemoveDiacritics(Text As String) As String
        Return New Regex("\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), String.Empty)
    End Function
End Class

实现

    Private Shared Sub DoStuff()
        MsgBox(StringExtension.RemoveDiacritics(inputString))
    End Sub

c#

using System.Text;
using System.Text.RegularExpressions;

namespace YourApplication
{
    public abstract class StringExtension
    {
        public static string RemoveDiacritics(string Text)
        {
            return new Regex(@"\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), string.Empty);
        }
    }
}

实现

        private static void DoStuff()
        {
            MessageBox.Show(StringExtension.RemoveDiacritics(inputString));
        }

Input: äáčďěéíľľňôóřŕšťúůýž ÄÁČĎĚÉÍĽĽŇÔŘŔŠŤÚŮÝŽ ÖÜË łŁđĐ țŢşŞçÇ øı

Output: aacdeeillnoorrstuuyz AACDEEILLNOORRSTUUYZ OUE łŁđĐ tTsScC øı

我加入了无法转换的字符,以帮助可视化接收到意外输入时会发生什么。

如果您还需要它来转换其他类型的字符,如波兰语的warsaw和Ł,那么根据您的需要,可以考虑合并这个答案(。NET Core友好),它使用CodePagesEncodingProvider到您的解决方案中。

我没有使用过这种方法,但是Michael Kaplan在他的博客文章(有一个令人困惑的标题)中描述了一种方法,谈论剥离变音符:剥离是一项有趣的工作(又名剥离) 论无意义的意义,即一切 Mn字符是非空格的,但是 有些更非间距比 其他人)

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder(capacity: normalizedString.Length);

    for (int i = 0; i < normalizedString.Length; i++)
    {
        char c = normalizedString[i];
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder
        .ToString()
        .Normalize(NormalizationForm.FormC);
}

请注意,这是他之前帖子的后续:剥离变音符....

该方法使用String。Normalize将输入字符串分割为组成符号(基本上是将“基本”字符与变音符符分开),然后扫描结果并仅保留基本字符。这只是有点复杂,但实际上你看到的是一个复杂的问题。

当然,如果你限制自己使用法语,你可能会使用@David Dibben推荐的如何在c++ std::string中删除重音和波浪号的简单基于表的方法。

与接受的答案相同,但更快,使用Span而不是StringBuilder。 需要。net Core 3.1或更新的。net。

static string RemoveDiacritics(string text) 
{
    ReadOnlySpan<char> normalizedString = text.Normalize(NormalizationForm.FormD);
    int i = 0;
    Span<char> span = text.Length < 1000
        ? stackalloc char[text.Length]
        : new char[text.Length];

    foreach (char c in normalizedString)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            span[i++] = c;
    }

    return new string(span).Normalize(NormalizationForm.FormC);
}

此外,这是可扩展的额外字符替换,如抛光Ł。

span[i++] = c switch
{
    'Ł' => 'L',
    'ł' => 'l',
    _ => c
};

一个小提示:堆栈分配stackalloc比堆分配new要快得多,它为垃圾收集器减少了工作。1000是一个阈值,以避免在堆栈上分配大结构,这可能会导致StackOverflowException。虽然1000是一个相当安全的值,但在大多数情况下10000甚至100000也可以(100k在堆栈上分配最多200kB,而默认堆栈大小为1mb)。然而10万对我来说有点危险。