如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
当前回答
这是我想到的:
import string
def match(a,b):
a,b = a.lower(), b.lower()
error = 0
for i in string.ascii_lowercase:
error += abs(a.count(i) - b.count(i))
total = len(a) + len(b)
return (total-error)/total
if __name__ == "__main__":
print(match("pple inc", "Apple Inc."))
其他回答
出于我的目的,我有自己的quick_ratio(),它比difflib SequenceMatcher的quick_ratio()快2倍,同时提供类似的结果。A和b是字符串:
score = 0
for letters in enumerate(a):
score = score + b.count(letters[1])
这是内置的。
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
使用它:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
这是我想到的:
import string
def match(a,b):
a,b = a.lower(), b.lower()
error = 0
for i in string.ascii_lowercase:
error += abs(a.count(i) - b.count(i))
total = len(a) + len(b)
return (total-error)/total
if __name__ == "__main__":
print(match("pple inc", "Apple Inc."))
内置的SequenceMatcher在大输入时非常慢,下面是如何用diff-match-patch完成的:
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
你可以创建这样一个函数:
def similar(w1, w2):
w1 = w1 + ' ' * (len(w2) - len(w1))
w2 = w2 + ' ' * (len(w1) - len(w2))
return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))