如何计算两个文本文档之间的相似度?

我正在寻找一个NLP项目，在任何编程语言(尽管Python将是我的偏好)。

我想取两个文档并确定它们有多相似。

当前回答

如果您对测量两段文本的语义相似性更感兴趣，我建议您看看这个gitlab项目。你可以把它作为服务器运行，也有一个预先构建的模型，你可以很容易地使用它来测量两段文本的相似性;尽管它主要用于测量两个句子的相似度，但你仍然可以在你的情况下使用它。它是用java编写的，但您可以将其作为RESTful服务运行。

另一个选择是DKPro Similarity，这是一个库，有各种算法来测量文本的相似性。然而，它也是用java编写的。

代码示例:

// this similarity measure is defined in the dkpro.similarity.algorithms.lexical-asl package
// you need to add that to your .pom to make that example work
// there are some examples that should work out of the box in dkpro.similarity.example-gpl 
TextSimilarityMeasure measure = new WordNGramJaccardMeasure(3);    // Use word trigrams

String[] tokens1 = "This is a short example text .".split(" ");   
String[] tokens2 = "A short example text could look like that .".split(" ");

double score = measure.getSimilarity(tokens1, tokens2);

System.out.println("Similarity: " + score);

2018-01-31 23:35:53

其他回答

I am combining the solutions from answers of @FredFoo and @Renaud. My solution is able to apply @Renaud's preprocessing on the text corpus of @FredFoo and then display pairwise similarities where the similarity is greater than 0. I ran this code on Windows by installing python and pip first. pip is installed as part of python but you may have to explicitly do it by re-running the installation package, choosing modify and then choosing pip. I use the command line to execute my python code saved in a file "similarity.py". I had to execute the following commands:

>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py

similar .py的代码如下:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]  

vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)   
                                                                                                                                                                                                                    
pairwise_similarity = tfidf * tfidf.T

#view the pairwise similarities 
print(pairwise_similarity)

#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))

2021-01-21 10:58:36

常见的方法是将文档转换为TF-IDF向量，然后计算它们之间的余弦相似度。任何关于信息检索(IR)的教科书都涵盖了这一点。参见《信息检索导论》，该书可在网上免费获得。

两两计算相似度

TF-IDF(以及类似的文本转换)在Python包Gensim和scikit-learn中实现。在后一个包中，计算余弦相似度非常简单

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

或者，如果文档是普通字符串，

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

尽管Gensim在这类任务中可能有更多选择。

再看看这个问题。

[免责声明:我参与了scikit-learn TF-IDF的实现。]

解读结果

从上面来看，pairwise_similarity是一个方形的Scipy稀疏矩阵，行数和列数等于语料库中文档的数量。

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

你可以通过.toarray()或.A将稀疏数组转换为NumPy数组:

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

假设我们想要找到与最终文档最相似的文档，“the scikit-learn docs are Orange and Blue”。本文语料库索引为4。您可以通过取该行的argmax来找到最相似文档的索引，但首先需要屏蔽1,1表示每个文档与其自身的相似性。你可以通过np.fill_diagonal()来实现后者，通过np.nanargmax()来实现前者:

>>> import numpy as np     
                                                                                                                                                                                                                                  
>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            
                                                                                                                                                                                                                 
>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

注意:使用稀疏矩阵的目的是为大型语料库和词汇表节省(大量空间)。你可以这样做，而不是转换为NumPy数组:

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

2012-01-17 15:54:00

这里有一个小应用程序让你开始…

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

2012-06-30 21:08:03

Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity. For Python, you can use NLTK.

2012-01-17 15:59:17

这里是Simphile NLP文本相似性Python包的创建者。Simphile包含几种文本相似度方法，它们与语言无关，并且比语言嵌入占用的cpu更少。

安装:

pip install simphile

选择你最喜欢的方法。这个例子显示了三点:

from simphile import jaccard_similarity, euclidian_similarity, compression_similarity

text_a = "I love dogs"
text_b = "I love cats"

print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")
print(f"Euclidian Similarity: {euclidian_similarity(text_a, text_b)}")
print(f"Compression Similarity: {compression_similarity(text_a, text_b)}")

压缩相似性——利用压缩算法的模式识别欧几里得相似性-把文本当作多维空间中的点，并计算它们的接近度 Jaccard Similairy -文字重叠越多，文本越相似

2022-09-30 11:48:00

如何计算两个文本文档之间的相似度?

推荐文章

最新文章

标签