在Python unicode字符串中删除重音(规范化)的最佳方法是什么?

我在Python中有一个Unicode字符串，我想删除所有的重音(变音符)。

我在网上找到了一个优雅的方法(在Java中):

将Unicode字符串转换为它的长规范化形式(使用单独的字符表示字母和变音符) 删除所有Unicode类型为“变音符”的字符。

我需要安装一个库，如pyICU或这是可能的Python标准库?那么python3呢?

重要提示:我希望避免使用从重音字符到非重音对应字符的显式映射的代码。

当前回答

实际上，我在项目兼容的python 2.6, 2.7和3.4上工作，我必须从免费用户条目中创建id。

感谢你，我创造了这个功能，工作的奇迹。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

2015-07-24 10:08:14

其他回答

unidcode是正确的答案。它将任何unicode字符串音译为最接近的ascii文本表示形式。

例子:

>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'

2010-04-13 21:21:14

这个怎么样:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

这也适用于希腊字母:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

字符类别“Mn”代表Nonspacing_Mark，它类似于MiniQuark回答中的unicodedata. combined(我没有想到unicodedata. combined，但它可能是更好的解决方案，因为它更显式)。

请记住，这些操作可能会极大地改变文本的含义。重音、变音等都不是“装饰”。

2009-02-05 22:17:22

对于@ mini夸克的回答:

我试图读取一个csv文件，它是一半法语(包含重音)，也有一些字符串，最终将成为整数和浮动。作为测试，我创建了一个test.txt文件，如下所示:

M, Montré艾尔,关于12.89太阳è弗兰çoise,不889,ël

我必须包括第2行和第3行才能让它工作(这是我在python ticket中找到的)，以及加入@Jabba的评论:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

结果:

Montreal
uber
12.89
Mere
Francoise
noel
889

(注意:我在Mac OS X 10.8.4和使用Python 2.7.3)

2013-06-12 15:48:48

实际上，我在项目兼容的python 2.6, 2.7和3.4上工作，我必须从免费用户条目中创建id。

感谢你，我创造了这个功能，工作的奇迹。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

2015-07-24 10:08:14

Gensim .utils.deaccent(text)来自Gensim -人类主题建模:

'Sef chomutovskych komunistu dostal postou bily prasek'

另一种解决方案是unicode。

请注意，建议的unicodedata解决方案通常只删除某些字符中的重音(例如，它将“ova”变成了“”，而不是“l”)。

2018-01-30 00:27:58

在Python unicode字符串中删除重音(规范化)的最佳方法是什么?

推荐文章

最新文章

标签