如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
如何在Python中获得一个字符串与另一个字符串相似的概率?
我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
当前回答
还添加了Spacy NLP库;
@profile
def main():
str1= "Mar 31 09:08:41 The world is beautiful"
str2= "Mar 31 19:08:42 Beautiful is the world"
print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
if __name__ == '__main__':
#python3 -m spacy download en_core_web_sm
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_md")
main()
使用Robert Kern的line_profiler运行
kernprof -l -v ./python/loganalysis/testspacy.py
NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562
然而,时间的启示
Function: main at line 32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
32 @profile
33 def main():
34 1 1.0 1.0 0.0 str1= "Mar 31 09:08:41 The world is beautiful"
35 1 0.0 0.0 0.0 str2= "Mar 31 19:08:42 Beautiful is the world"
36 1 43248.0 43248.0 99.1 print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
37 1 375.0 375.0 0.9 print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
38 1 30.0 30.0 0.1 print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
其他回答
出于我的目的,我有自己的quick_ratio(),它比difflib SequenceMatcher的quick_ratio()快2倍,同时提供类似的结果。A和b是字符串:
score = 0
for letters in enumerate(a):
score = score + b.count(letters[1])
我想你们可能在寻找一种描述字符串之间距离的算法。这里有一些你可以参考的:
汉明距离 Levenshtein距离 Damerau-Levenshtein距离 Jaro-Winkler距离
这是内置的。
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
使用它:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
包装距离包括Levenshtein距离:
import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3
这是我想到的:
import string
def match(a,b):
a,b = a.lower(), b.lower()
error = 0
for i in string.ascii_lowercase:
error += abs(a.count(i) - b.count(i))
total = len(a) + len(b)
return (total-error)/total
if __name__ == "__main__":
print(match("pple inc", "Apple Inc."))