如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
当前回答
这是内置的。
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
使用它:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
其他回答
Python3.6 + = 没有导入图书馆 在大多数情况下工作良好
在堆栈溢出,当你试图添加一个标签或发布一个问题,它会带来所有相关的东西。这是如此方便,正是我正在寻找的算法。因此,我编写了一个查询集相似度过滤器。
def compare(qs, ip):
al = 2
v = 0
for ii, letter in enumerate(ip):
if letter == qs[ii]:
v += al
else:
ac = 0
for jj in range(al):
if ii - jj < 0 or ii + jj > len(qs) - 1:
break
elif letter == qs[ii - jj] or letter == qs[ii + jj]:
ac += jj
break
v += ac
return v
def getSimilarQuerySet(queryset, inp, length):
return [k for tt, (k, v) in enumerate(reversed(sorted({it: compare(it, inp) for it in queryset}.items(), key=lambda item: item[1])))][:length]
if __name__ == "__main__":
print(compare('apple', 'mongo'))
# 0
print(compare('apple', 'apple'))
# 10
print(compare('apple', 'appel'))
# 7
print(compare('dude', 'ud'))
# 1
print(compare('dude', 'du'))
# 4
print(compare('dude', 'dud'))
# 6
print(compare('apple', 'mongo'))
# 2
print(compare('apple', 'appel'))
# 8
print(getSimilarQuerySet(
[
"java",
"jquery",
"javascript",
"jude",
"aja",
],
"ja",
2,
))
# ['javascript', 'java']
解释
compare takes two string and returns a positive integer. you can edit the al allowed variable in compare, it indicates how large the range we need to search through. It works like this: two strings are iterated, if same character is find at same index, then accumulator will be added to a largest value. Then, we search in the index range of allowed, if matched, add to the accumulator based on how far the letter is. (the further, the smaller) length indicate how many items you want as result, that is most similar to input string.
包装距离包括Levenshtein距离:
import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3
你可以创建这样一个函数:
def similar(w1, w2):
w1 = w1 + ' ' * (len(w2) - len(w1))
w2 = w2 + ' ' * (len(w1) - len(w2))
return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))
TheFuzz是一个用python实现Levenshtein距离的包,在某些情况下,当你希望两个不同的字符串被认为是相同的时,它带有一些帮助函数来提供帮助。例如:
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
注意,difflib。SequenceMatcher只找到最长的连续匹配子序列,这通常不是我们想要的,例如:
>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012 # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)] # only the first block is recorded
寻找两个字符串之间的相似性与生物信息学中成对序列比对的概念密切相关。有许多专门的库,包括生物马拉松。这个例子实现了Needleman Wunsch算法:
>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'
使用biopython或其他生物信息学包比python标准库的任何部分都更灵活,因为有许多不同的评分方案和算法可用。此外,你可以得到匹配的序列来可视化正在发生的事情:
>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el