在Python unicode字符串中删除重音(规范化)的最佳方法是什么?

我在Python中有一个Unicode字符串，我想删除所有的重音(变音符)。

我在网上找到了一个优雅的方法(在Java中):

将Unicode字符串转换为它的长规范化形式(使用单独的字符表示字母和变音符) 删除所有Unicode类型为“变音符”的字符。

我需要安装一个库，如pyICU或这是可能的Python标准库?那么python3呢?

重要提示:我希望避免使用从重音字符到非重音对应字符的显式映射的代码。

当前回答

这里已经有很多答案，但之前没有考虑过:使用sklearn

from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode

accented_string = u'Málagueña®'

print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena

如果您已经在使用sklearn处理文本，这一点特别有用。这些是由CountVectorizer等类内部调用的函数，用于规范化字符串:当使用strip_accent ='ascii'时，则调用strip_accents_ascii;当使用strip_accent ='unicode'时，则调用strip_accents_unicode。

更多的细节

最后，考虑文档字符串中的这些细节:

Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing

Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.

and

Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart

Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.

2022-06-02 12:51:24

其他回答

unidcode是正确的答案。它将任何unicode字符串音译为最接近的ascii文本表示形式。

例子:

>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'

2010-04-13 21:21:14

如果您希望获得类似Elasticsearch的ascii折叠过滤器的功能，您可能需要考虑fold-to-ascii，这是[本身]…

Apache Lucene ASCII折叠过滤器的Python端口，它将字母、数字和符号Unicode字符转换为不属于前127个ASCII字符的字符(“基本拉丁”Unicode块)，如果它们存在的话。

下面是上面提到的一个例子:

from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'

编辑:fold_to_ascii模块似乎可以很好地规范化基于拉丁的字母;然而，不可映射的字符将被删除，这意味着该模块将减少中文文本，例如，空字符串。如果您想保留中文、日语和其他Unicode字母，可以考虑使用上面@mo-han的remove_accent_chars_regex实现。

2021-05-02 10:09:02

我刚刚在网上找到了这个答案:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

它工作得很好(例如，对于法语)，但我认为第二步(删除重音)可能比删除非ascii字符处理得更好，因为这对于某些语言(例如，希腊语)会失败。最好的解决方案可能是显式地删除标记为变音符的unicode字符。

编辑:这招管用:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

如果字符c可以与前面的字符组合，主要是如果它是一个变音符，则unicodedata. combined (c)将返回true。

编辑2:remove_accent需要unicode字符串，而不是字节字符串。如果你有一个字节字符串，那么你必须像这样解码成一个unicode字符串:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

2009-02-05 21:19:34

实际上，我在项目兼容的python 2.6, 2.7和3.4上工作，我必须从免费用户条目中创建id。

感谢你，我创造了这个功能，工作的奇迹。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

2015-07-24 10:08:14

import unicodedata
from random import choice

import perfplot
import regex
import text_unidecode


def remove_accent_chars_regex(x: str):
    return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))


def remove_accent_chars_join(x: str):
    # answer by MiniQuark
    # https://stackoverflow.com/a/517974/7966259
    return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])


perfplot.show(
    setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
    kernels=[
        remove_accent_chars_regex,
        remove_accent_chars_join,
        text_unidecode.unidecode,
    ],
    labels=['regex', 'join', 'unidecode'],
    n_range=[2 ** k for k in range(22)],
    equality_check=None, relative_to=0, xlabel='str len'
)

2021-02-03 02:59:45

在Python unicode字符串中删除重音(规范化)的最佳方法是什么?

推荐文章

最新文章

标签