如何计算两个文本文档之间的相似度?

我正在寻找一个NLP项目，在任何编程语言(尽管Python将是我的偏好)。

我想取两个文档并确定它们有多相似。

当前回答

句法相似性有3种简单的方法来检测相似性。

Word2Vec 手套 Tfidf或countvectorizer

语义相似性可以使用BERT嵌入和尝试不同的词池策略来获得文档嵌入，然后在文档嵌入上应用余弦相似度。

一种先进的方法是利用BERT分数来获得相似度。

研究论文链接:https://arxiv.org/abs/1904.09675

2019-11-14 10:28:10

其他回答

I am combining the solutions from answers of @FredFoo and @Renaud. My solution is able to apply @Renaud's preprocessing on the text corpus of @FredFoo and then display pairwise similarities where the similarity is greater than 0. I ran this code on Windows by installing python and pip first. pip is installed as part of python but you may have to explicitly do it by re-running the installation package, choosing modify and then choosing pip. I use the command line to execute my python code saved in a file "similarity.py". I had to execute the following commands:

>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py

similar .py的代码如下:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]  

vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)   
                                                                                                                                                                                                                    
pairwise_similarity = tfidf * tfidf.T

#view the pairwise similarities 
print(pairwise_similarity)

#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))

2021-01-21 10:58:36

你可能想尝试一下cos文档相似度的在线服务http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject

2013-02-12 11:49:29

句法相似性有3种简单的方法来检测相似性。

Word2Vec 手套 Tfidf或countvectorizer

语义相似性可以使用BERT嵌入和尝试不同的词池策略来获得文档嵌入，然后在文档嵌入上应用余弦相似度。

一种先进的方法是利用BERT分数来获得相似度。

研究论文链接:https://arxiv.org/abs/1904.09675

2019-11-14 10:28:10

为了用更少的数据集找到句子的相似性，并获得更高的精度，你可以使用下面的python包，它使用预训练的BERT模型，

pip install similar-sentences

2020-04-16 12:14:54

如果您对测量两段文本的语义相似性更感兴趣，我建议您看看这个gitlab项目。你可以把它作为服务器运行，也有一个预先构建的模型，你可以很容易地使用它来测量两段文本的相似性;尽管它主要用于测量两个句子的相似度，但你仍然可以在你的情况下使用它。它是用java编写的，但您可以将其作为RESTful服务运行。

另一个选择是DKPro Similarity，这是一个库，有各种算法来测量文本的相似性。然而，它也是用java编写的。

代码示例:

// this similarity measure is defined in the dkpro.similarity.algorithms.lexical-asl package
// you need to add that to your .pom to make that example work
// there are some examples that should work out of the box in dkpro.similarity.example-gpl 
TextSimilarityMeasure measure = new WordNGramJaccardMeasure(3);    // Use word trigrams

String[] tokens1 = "This is a short example text .".split(" ");   
String[] tokens2 = "A short example text could look like that .".split(" ");

double score = measure.getSimilarity(tokens1, tokens2);

System.out.println("Similarity: " + score);

2018-01-31 23:35:53

如何计算两个文本文档之间的相似度?

推荐文章

最新文章

标签