好的Python模块模糊字符串比较?

我正在寻找一个Python模块，可以做简单的模糊字符串比较。具体来说，我想知道字符串相似程度的百分比。我知道这是潜在的主观，所以我希望找到一个库，可以做位置比较以及最长的相似字符串匹配，等等。

基本上，我希望找到一些足够简单的东西，可以产生单个百分比，同时仍然可以配置，以便我可以指定要进行哪种类型的比较。

当前回答

Jellyfish是一个Python模块，支持许多字符串比较指标，包括语音匹配。与Jellyfish的实现相比，纯Python实现的Levenstein编辑距离非常慢。

使用示例:

import jellyfish

>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2 
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'`

2011-12-03 19:20:23

其他回答

谷歌也有自己的Google -diff-match-patch(“目前在Java, JavaScript, c++和Python中可用”)。

(不能评论它，因为我自己只使用python的difflib)

2009-03-25 17:47:33

正如nosklo所说，使用Python标准库中的difflib模块。

difflib模块可以使用SequenceMatcher()对象的ratio()方法返回序列相似性的度量值。相似度作为0.0到1.0范围内的浮点数返回。

>>> import difflib

>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0

>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004

>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0

2010-03-10 17:03:57

我用的是双变音位，就像一个咒语。

一个例子:

>>> dm(u'aubrey')
('APR', '')
>>> dm(u'richard')
('RXRT', 'RKRT')
>>> dm(u'katherine') == dm(u'catherine')
True

更新: 水母也有。在语音编码下。

2011-12-16 06:30:54

Levenshtein Python扩展和C库。

https://github.com/ztane/python-Levenshtein/

Levenshtein Python C扩展模块包含用于快速的函数计算 - Levenshtein(编辑)距离，编辑操作 -字符串相似度 -近似中值字符串，通常字符串平均 -字符串序列和集相似度它支持普通字符串和Unicode字符串。

$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

2009-03-26 07:18:51

Jellyfish是一个Python模块，支持许多字符串比较指标，包括语音匹配。与Jellyfish的实现相比，纯Python实现的Levenstein编辑距离非常慢。

使用示例:

import jellyfish

>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2 
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'`

2011-12-03 19:20:23

好的Python模块模糊字符串比较?

推荐文章

最新文章

标签