我试图转换一些字符串,在法国加拿大,基本上,我想能够拿出法国重音标记在字母,同时保持字母。(例如,将é转换为e,那么crème brûlée就会变成creme brulee)

实现这一目标的最佳方法是什么?


当前回答

c#字符串扩展方法

我认为保留字符串含义的最佳解决方案是转换字符,而不是剥离它们,示例crème brûlée很好地说明了这一点,即crme brle vs. creme brulee。

我查看了上面Alexander的评论,看到了Lucene。Net代码是Apache 2.0许可的,因此我将该类修改为一个简单的字符串扩展方法。你可以这样使用它:

var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength); 
// "creme brulee"

这个函数太长了,不能在StackOverflow的答案中发布(~139k字符的30k允许lol),所以我做了一个要点,并将作者的名字归为:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
/// 
/// <ul>
///   <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
///   <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
///   <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
///   <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
///   <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
///   <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
///   <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
///   <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
///   <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
///   <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
///   <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
///   <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
///   <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
///   <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
///   <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
///   <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&amp;agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// </summary>
    /// <param name="input">     The string of characters to fold </param>
    /// <param name="length">    The length of the folded return string </param>
    /// <returns> length of output </returns>
    public static string FoldToASCII(this string input, int? length = null)
    {
        // See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
    }
}

希望这能帮助到其他人,这是我发现的最强大的解决方案!

其他回答

没有足够的声誉,显然我不能评论亚历山大的优秀链接。Lucene似乎是唯一的解决方案在合理的通用情况下工作。

对于那些想要一个简单的复制粘贴解决方案的人,这里是利用Lucene中的代码:

字符串试验台= " AAAACEIIOOØUUÞaaaaaaæceeeeiiiið人参公鸡øUUāăčĐęğıŁłńŌōřŞşšźžșțệủ”;

Console.WriteLine (Lucene.latinizeLucene(实验);

AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu

//////////

public static class Lucene
{
    // source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
    // idea: https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net (scroll down, search for lucene by Alexander)
    public static string latinizeLucene(string arg)
    {
        char[] argChar = arg.ToCharArray();

        // latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ⑽ to (10)
        char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();

        int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);

        string ret = new string(resultChar);
        ret = ret.Substring(0, outputPos);

        return ret;
    }

    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// <para/>
    /// @lucene.internal
    /// </summary>
    /// <param name="input">     The characters to fold </param>
    /// <param name="inputPos">  Index of the first character to fold </param>
    /// <param name="output">    The result of the folding. Should be of size >= <c>length * 4</c>. </param>
    /// <param name="outputPos"> Index of output where to put the result of the folding </param>
    /// <param name="length">    The number of characters to fold </param>
    /// <returns> length of output </returns>
    private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
    {
        int end = inputPos + length;
        for (int pos = inputPos; pos < end; ++pos)
        {
            char c = input[pos];

            // Quick test: if it's not in range then just keep current character
            if (c < '\u0080')
            {
                output[outputPos++] = c;
            }
            else
            {
                switch (c)
                {
                    case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
                    case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
                    case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
                    case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
                    case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
                    case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
                    case '\u0100': // Ā  [LATIN CAPITAL LETTER A WITH MACRON]
                    case '\u0102': // Ă  [LATIN CAPITAL LETTER A WITH BREVE]
                    case '\u0104': // Ą  [LATIN CAPITAL LETTER A WITH OGONEK]
                    case '\u018F': // Ə  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
                    case '\u01CD': // Ǎ  [LATIN CAPITAL LETTER A WITH CARON]
                    case '\u01DE': // Ǟ  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E0': // Ǡ  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FA': // Ǻ  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0200': // Ȁ  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
                    case '\u0202': // Ȃ  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
                    case '\u0226': // Ȧ  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
                    case '\u023A': // Ⱥ  [LATIN CAPITAL LETTER A WITH STROKE]
                    case '\u1D00': // ᴀ  [LATIN LETTER SMALL CAPITAL A]
                    case '\u1E00': // Ḁ  [LATIN CAPITAL LETTER A WITH RING BELOW]
                    case '\u1EA0': // Ạ  [LATIN CAPITAL LETTER A WITH DOT BELOW]
                    case '\u1EA2': // Ả  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
                    case '\u1EA4': // Ấ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA6': // Ầ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA8': // Ẩ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAA': // Ẫ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAC': // Ậ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAE': // Ắ  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB0': // Ằ  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB2': // Ẳ  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB4': // Ẵ  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB6': // Ặ  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u24B6': // Ⓐ  [CIRCLED LATIN CAPITAL LETTER A]
                    case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
                        output[outputPos++] = 'A';
                        break;
                    case '\u00E0': // à  [LATIN SMALL LETTER A WITH GRAVE]
                    case '\u00E1': // á  [LATIN SMALL LETTER A WITH ACUTE]
                    case '\u00E2': // â  [LATIN SMALL LETTER A WITH CIRCUMFLEX]
                    case '\u00E3': // ã  [LATIN SMALL LETTER A WITH TILDE]
                    case '\u00E4': // ä  [LATIN SMALL LETTER A WITH DIAERESIS]
                    case '\u00E5': // å  [LATIN SMALL LETTER A WITH RING ABOVE]
                    case '\u0101': // ā  [LATIN SMALL LETTER A WITH MACRON]
                    case '\u0103': // ă  [LATIN SMALL LETTER A WITH BREVE]
                    case '\u0105': // ą  [LATIN SMALL LETTER A WITH OGONEK]
                    case '\u01CE': // ǎ  [LATIN SMALL LETTER A WITH CARON]
                    case '\u01DF': // ǟ  [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E1': // ǡ  [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FB': // ǻ  [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0201': // ȁ  [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
                    case '\u0203': // ȃ  [LATIN SMALL LETTER A WITH INVERTED BREVE]
                    case '\u0227': // ȧ  [LATIN SMALL LETTER A WITH DOT ABOVE]
                    case '\u0250': // ɐ  [LATIN SMALL LETTER TURNED A]
                    case '\u0259': // ə  [LATIN SMALL LETTER SCHWA]
                    case '\u025A': // ɚ  [LATIN SMALL LETTER SCHWA WITH HOOK]
                    case '\u1D8F': // ᶏ  [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
                    case '\u1D95': // ᶕ  [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
                    case '\u1E01': // ạ  [LATIN SMALL LETTER A WITH RING BELOW]
                    case '\u1E9A': // ả  [LATIN SMALL LETTER A WITH RIGHT HALF RING]
                    case '\u1EA1': // ạ  [LATIN SMALL LETTER A WITH DOT BELOW]
                    case '\u1EA3': // ả  [LATIN SMALL LETTER A WITH HOOK ABOVE]
                    case '\u1EA5': // ấ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA7': // ầ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA9': // ẩ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAB': // ẫ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAD': // ậ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAF': // ắ  [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB1': // ằ  [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB3': // ẳ  [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB5': // ẵ  [LATIN SMALL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB7': // ặ  [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u2090': // ₐ  [LATIN SUBSCRIPT SMALL LETTER A]
                    case '\u2094': // ₔ  [LATIN SUBSCRIPT SMALL LETTER SCHWA]
                    case '\u24D0': // ⓐ  [CIRCLED LATIN SMALL LETTER A]
                    case '\u2C65': // ⱥ  [LATIN SMALL LETTER A WITH STROKE]
                    case '\u2C6F': // Ɐ  [LATIN CAPITAL LETTER TURNED A]
                    case '\uFF41': // a  [FULLWIDTH LATIN SMALL LETTER A]
                        output[outputPos++] = 'a';
                        break;
                    case '\uA732': // Ꜳ  [LATIN CAPITAL LETTER AA]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'A';
                        break;
                    case '\u00C6': // Æ  [LATIN CAPITAL LETTER AE]
                    case '\u01E2': // Ǣ  [LATIN CAPITAL LETTER AE WITH MACRON]
                    case '\u01FC': // Ǽ  [LATIN CAPITAL LETTER AE WITH ACUTE]
                    case '\u1D01': // ᴁ  [LATIN LETTER SMALL CAPITAL AE]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'E';
                        break;
                    case '\uA734': // Ꜵ  [LATIN CAPITAL LETTER AO]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'O';
                        break;
                    case '\uA736': // Ꜷ  [LATIN CAPITAL LETTER AU]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'U';
                        break;

        // etc. etc. etc.
        // see link above for complete source code
        // 
        // unfortunately, postings are limited, as in
        // "Body is limited to 30000 characters; you entered 136098."

                    [...]

                    case '\u2053': // ⁓  [SWUNG DASH]
                    case '\uFF5E': // ~  [FULLWIDTH TILDE]
                        output[outputPos++] = '~';
                        break;
                    default:
                        output[outputPos++] = c;
                        break;
                }
            }
        }
        return outputPos;
    }
}

我需要一些东西,转换所有主要的unicode字符和投票的答案留下了一些,所以我已经创建了一个CodeIgniter的convert_accented_characters($str)的版本为c#,很容易自定义:

using System;
using System.Text;
using System.Collections.Generic;

public static class Strings
{
    static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
    {
        { "äæǽ", "ae" },
        { "öœ", "oe" },
        { "ü", "ue" },
        { "Ä", "Ae" },
        { "Ü", "Ue" },
        { "Ö", "Oe" },
        { "ÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶА", "A" },
        { "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" },
        { "Б", "B" },
        { "б", "b" },
        { "ÇĆĈĊČ", "C" },
        { "çćĉċč", "c" },
        { "Д", "D" },
        { "д", "d" },
        { "ÐĎĐΔ", "Dj" },
        { "ðďđδ", "dj" },
        { "ÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" },
        { "èéêëēĕėęěέεẽẻẹềếễểệеэ", "e" },
        { "Ф", "F" },
        { "ф", "f" },
        { "ĜĞĠĢΓГҐ", "G" },
        { "ĝğġģγгґ", "g" },
        { "ĤĦ", "H" },
        { "ĥħ", "h" },
        { "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" },
        { "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" },
        { "Ĵ", "J" },
        { "ĵ", "j" },
        { "ĶΚК", "K" },
        { "ķκк", "k" },
        { "ĹĻĽĿŁΛЛ", "L" },
        { "ĺļľŀłλл", "l" },
        { "М", "M" },
        { "м", "m" },
        { "ÑŃŅŇΝН", "N" },
        { "ñńņňʼnνн", "n" },
        { "ÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢО", "O" },
        { "òóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợо", "o" },
        { "П", "P" },
        { "п", "p" },
        { "ŔŖŘΡР", "R" },
        { "ŕŗřρр", "r" },
        { "ŚŜŞȘŠΣС", "S" },
        { "śŝşșšſσςс", "s" },
        { "ȚŢŤŦτТ", "T" },
        { "țţťŧт", "t" },
        { "ÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" },
        { "ùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựу", "u" },
        { "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" },
        { "ýÿŷỳỹỷỵй", "y" },
        { "В", "V" },
        { "в", "v" },
        { "Ŵ", "W" },
        { "ŵ", "w" },
        { "ŹŻŽΖЗ", "Z" },
        { "źżžζз", "z" },
        { "ÆǼ", "AE" },
        { "ß", "ss" },
        { "IJ", "IJ" },
        { "ij", "ij" },
        { "Œ", "OE" },
        { "ƒ", "f" },
        { "ξ", "ks" },
        { "π", "p" },
        { "β", "v" },
        { "μ", "m" },
        { "ψ", "ps" },
        { "Ё", "Yo" },
        { "ё", "yo" },
        { "Є", "Ye" },
        { "є", "ye" },
        { "Ї", "Yi" },
        { "Ж", "Zh" },
        { "ж", "zh" },
        { "Х", "Kh" },
        { "х", "kh" },
        { "Ц", "Ts" },
        { "ц", "ts" },
        { "Ч", "Ch" },
        { "ч", "ch" },
        { "Ш", "Sh" },
        { "ш", "sh" },
        { "Щ", "Shch" },
        { "щ", "shch" },
        { "ЪъЬь", "" },
        { "Ю", "Yu" },
        { "ю", "yu" },
        { "Я", "Ya" },
        { "я", "ya" },
    };

    public static char RemoveDiacritics(this char c){
        foreach(KeyValuePair<string, string> entry in foreign_characters)
        {
            if(entry.Key.IndexOf (c) != -1)
            {
                return entry.Value[0];
            }
        }
        return c;
    }

    public static string RemoveDiacritics(this string s) 
    {
        //StringBuilder sb = new StringBuilder ();
        string text = "";


        foreach (char c in s)
        {
            int len = text.Length;

            foreach(KeyValuePair<string, string> entry in foreign_characters)
            {
                if(entry.Key.IndexOf (c) != -1)
                {
                    text += entry.Value;
                    break;
                }
            }

            if (len == text.Length) {
                text += c;  
            }
        }
        return text;
    }
}

使用

// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee

// for chars
"Ã"[0].RemoveDiacritics (); // A

为了像最初的问题一样简单地删除法语加拿大重音标记,这里有一个使用正则表达式而不是硬编码转换和For/Next循环的替代方法。根据您的需要,它可以被压缩成一行代码;但是,我将它添加到一个扩展类中,以便于重用。

Visual Basic

Imports System.Text
Imports System.Text.RegularExpressions

Public MustInherit Class StringExtension
    Public Shared Function RemoveDiacritics(Text As String) As String
        Return New Regex("\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), String.Empty)
    End Function
End Class

实现

    Private Shared Sub DoStuff()
        MsgBox(StringExtension.RemoveDiacritics(inputString))
    End Sub

c#

using System.Text;
using System.Text.RegularExpressions;

namespace YourApplication
{
    public abstract class StringExtension
    {
        public static string RemoveDiacritics(string Text)
        {
            return new Regex(@"\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), string.Empty);
        }
    }
}

实现

        private static void DoStuff()
        {
            MessageBox.Show(StringExtension.RemoveDiacritics(inputString));
        }

Input: äáčďěéíľľňôóřŕšťúůýž ÄÁČĎĚÉÍĽĽŇÔŘŔŠŤÚŮÝŽ ÖÜË łŁđĐ țŢşŞçÇ øı

Output: aacdeeillnoorrstuuyz AACDEEILLNOORRSTUUYZ OUE łŁđĐ tTsScC øı

我加入了无法转换的字符,以帮助可视化接收到意外输入时会发生什么。

如果您还需要它来转换其他类型的字符,如波兰语的warsaw和Ł,那么根据您的需要,可以考虑合并这个答案(。NET Core友好),它使用CodePagesEncodingProvider到您的解决方案中。

这段代码对我很有用:

var updatedText = text.Normalize(NormalizationForm.FormD)
     .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
     .ToArray();

但是,请不要对名字这样做。这不仅是对名字中有变音或口音的人的侮辱,在某些情况下也可能是危险的错误(见下文)。除了去掉重音,还有其他的写法。

此外,这是错误和危险的,例如,如果用户必须如实提供护照上的名字。

例如,我的名字写着Zuberbühler,在我护照的机读部分,你会发现祖伯布勒。去掉变音后,名字将与两个部分都不匹配。这可能会给用户带来问题。

您应该禁止在名字的输入表单中使用变音/重音,以便用户可以正确地书写没有变音或重音的名字。

例如,如果申请ESTA的web服务(https://www.application-esta.co.uk/special-characters-and)使用上述代码,而不是正确地转换变音,ESTA申请要么会被拒绝,要么旅行者在进入美国时将与美国边境控制出现问题。

另一个例子是机票。假设您有一个机票预订web应用程序,用户提供带有重音的名字,您的实现只是删除重音,然后使用航空公司的web服务预订机票!您的客户可能不被允许登机,因为姓名与他/她护照的任何部分不匹配。

我真的很喜欢azrafe7提供的简洁实用的代码。 所以,我稍微改变了一下,把它转换成一个扩展方法:

public static class StringExtensions
{
    public static string RemoveDiacritics(this string text)
    {
        const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";

        if (string.IsNullOrEmpty(text))
        {
            return string.Empty;
        }

        return Encoding.ASCII.GetString(
            Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));
    }
}