如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
当前回答
这是内置的。
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
使用它:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
其他回答
Python3.6 + = 没有导入图书馆 在大多数情况下工作良好
在堆栈溢出,当你试图添加一个标签或发布一个问题,它会带来所有相关的东西。这是如此方便,正是我正在寻找的算法。因此,我编写了一个查询集相似度过滤器。
def compare(qs, ip):
al = 2
v = 0
for ii, letter in enumerate(ip):
if letter == qs[ii]:
v += al
else:
ac = 0
for jj in range(al):
if ii - jj < 0 or ii + jj > len(qs) - 1:
break
elif letter == qs[ii - jj] or letter == qs[ii + jj]:
ac += jj
break
v += ac
return v
def getSimilarQuerySet(queryset, inp, length):
return [k for tt, (k, v) in enumerate(reversed(sorted({it: compare(it, inp) for it in queryset}.items(), key=lambda item: item[1])))][:length]
if __name__ == "__main__":
print(compare('apple', 'mongo'))
# 0
print(compare('apple', 'apple'))
# 10
print(compare('apple', 'appel'))
# 7
print(compare('dude', 'ud'))
# 1
print(compare('dude', 'du'))
# 4
print(compare('dude', 'dud'))
# 6
print(compare('apple', 'mongo'))
# 2
print(compare('apple', 'appel'))
# 8
print(getSimilarQuerySet(
[
"java",
"jquery",
"javascript",
"jude",
"aja",
],
"ja",
2,
))
# ['javascript', 'java']
解释
compare takes two string and returns a positive integer. you can edit the al allowed variable in compare, it indicates how large the range we need to search through. It works like this: two strings are iterated, if same character is find at same index, then accumulator will be added to a largest value. Then, we search in the index range of allowed, if matched, add to the accumulator based on how far the letter is. (the further, the smaller) length indicate how many items you want as result, that is most similar to input string.
TheFuzz是一个用python实现Levenshtein距离的包,在某些情况下,当你希望两个不同的字符串被认为是相同的时,它带有一些帮助函数来提供帮助。例如:
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
我想你们可能在寻找一种描述字符串之间距离的算法。这里有一些你可以参考的:
汉明距离 Levenshtein距离 Damerau-Levenshtein距离 Jaro-Winkler距离
这是我想到的:
import string
def match(a,b):
a,b = a.lower(), b.lower()
error = 0
for i in string.ascii_lowercase:
error += abs(a.count(i) - b.count(i))
total = len(a) + len(b)
return (total-error)/total
if __name__ == "__main__":
print(match("pple inc", "Apple Inc."))
BLEUscore
BLEU,即双语评估替补,是一个用于比较的分数 文本到一个或多个参考译文的候选翻译。 完全匹配的结果是1.0,而完全不匹配的结果是1.0 结果得分为0.0。 虽然它是为翻译而开发的,但也可以用来评估文本 为一套自然语言处理任务生成。
代码:
import nltk
from nltk.translate import bleu
from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4
C1='Text'
C2='Best'
print('BLEUscore:',bleu([C1], C2, smoothing_function=smoothie))
示例:通过更新C1和C2。
C1='Test' C2='Test'
BLEUscore: 1.0
C1='Test' C2='Best'
BLEUscore: 0.2326589746035907
C1='Test' C2='Text'
BLEUscore: 0.2866227639866161
你也可以比较句子的相似度:
C1='It is tough.' C2='It is rough.'
BLEUscore: 0.7348889200874658
C1='It is tough.' C2='It is tough.'
BLEUscore: 1.0