我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。
我想取两个文档并确定它们有多相似。
我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。
我想取两个文档并确定它们有多相似。
当前回答
你可能想尝试一下cos文档相似度的在线服务http://www.scurtu.it/documentSimilarity.html
import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)
print responseObject
其他回答
这是一个老问题了,但我发现斯派西可以很容易地解决这个问题。读取文档后,可以使用简单的api相似性来查找文档向量之间的余弦相似性。
首先安装包并下载模型:
pip install spacy
python -m spacy download en_core_web_sm
然后用like so:
import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')
print (doc1.similarity(doc2)) # 0.999999954642
print (doc2.similarity(doc3)) # 0.699032527716
print (doc1.similarity(doc3)) # 0.699032527716
与@larsman相同,但有一些预处理
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt') # if necessary...
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')
你可能想尝试一下cos文档相似度的在线服务http://www.scurtu.it/documentSimilarity.html
import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)
print responseObject
这里有一个小应用程序让你开始…
import difflib as dl
a = file('file').read()
b = file('file1').read()
sim = dl.get_close_matches
s = 0
wa = a.split()
wb = b.split()
for i in wa:
if sim(i, wb):
s += 1
n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)
Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity. For Python, you can use NLTK.