我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。

我想取两个文档并确定它们有多相似。


当前回答

这里有一个小应用程序让你开始…

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

其他回答

这里有一个小应用程序让你开始…

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

I am combining the solutions from answers of @FredFoo and @Renaud. My solution is able to apply @Renaud's preprocessing on the text corpus of @FredFoo and then display pairwise similarities where the similarity is greater than 0. I ran this code on Windows by installing python and pip first. pip is installed as part of python but you may have to explicitly do it by re-running the installation package, choosing modify and then choosing pip. I use the command line to execute my python code saved in a file "similarity.py". I had to execute the following commands:

>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py

similar .py的代码如下:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]  

vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)   
                                                                                                                                                                                                                    
pairwise_similarity = tfidf * tfidf.T

#view the pairwise similarities 
print(pairwise_similarity)

#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))

我们可以使用句子转换来完成这个任务 链接

下面是一个来自sbert的简单示例:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside']
sentences2 = ['The dog plays in the garden']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
   print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], 
         sentences2[i], cosine_scores[i][i]))

你可能想尝试一下cos文档相似度的在线服务http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject

这是一个老问题了,但我发现斯派西可以很容易地解决这个问题。读取文档后,可以使用简单的api相似性来查找文档向量之间的余弦相似性。

首先安装包并下载模型:

pip install spacy
python -m spacy download en_core_web_sm

然后用like so:

import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print (doc1.similarity(doc2)) # 0.999999954642
print (doc2.similarity(doc3)) # 0.699032527716
print (doc1.similarity(doc3)) # 0.699032527716