我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。
我想取两个文档并确定它们有多相似。
我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。
我想取两个文档并确定它们有多相似。
当前回答
如果您对测量两段文本的语义相似性更感兴趣,我建议您看看这个gitlab项目。你可以把它作为服务器运行,也有一个预先构建的模型,你可以很容易地使用它来测量两段文本的相似性;尽管它主要用于测量两个句子的相似度,但你仍然可以在你的情况下使用它。它是用java编写的,但您可以将其作为RESTful服务运行。
另一个选择是DKPro Similarity,这是一个库,有各种算法来测量文本的相似性。然而,它也是用java编写的。
代码示例:
// this similarity measure is defined in the dkpro.similarity.algorithms.lexical-asl package
// you need to add that to your .pom to make that example work
// there are some examples that should work out of the box in dkpro.similarity.example-gpl
TextSimilarityMeasure measure = new WordNGramJaccardMeasure(3); // Use word trigrams
String[] tokens1 = "This is a short example text .".split(" ");
String[] tokens2 = "A short example text could look like that .".split(" ");
double score = measure.getSimilarity(tokens1, tokens2);
System.out.println("Similarity: " + score);
其他回答
为了用更少的数据集找到句子的相似性,并获得更高的精度,你可以使用下面的python包,它使用预训练的BERT模型,
pip install similar-sentences
如果你正在寻找一些非常精确的东西,你需要使用一些比tf-idf更好的工具。通用句子编码器是最准确的找到任何两段文本之间的相似性的编码器之一。谷歌提供了预训练的模型,您可以将其用于自己的应用程序,而不需要从头开始训练任何东西。首先,你必须安装tensorflow和tensorflow-hub:
pip install tensorflow
pip install tensorflow_hub
下面的代码允许您将任何文本转换为固定长度的向量表示,然后您可以使用点积来找出它们之间的相似性
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/1?tf-hub-format=compressed"
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)
# sample text
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]
similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})
corr = np.inner(message_embeddings_, message_embeddings_)
print(corr)
heatmap(messages, messages, corr)
绘图的代码:
def heatmap(x_labels, y_labels, values):
fig, ax = plt.subplots()
im = ax.imshow(values)
# We want to show all ticks...
ax.set_xticks(np.arange(len(x_labels)))
ax.set_yticks(np.arange(len(y_labels)))
# ... and label them with the respective list entries
ax.set_xticklabels(x_labels)
ax.set_yticklabels(y_labels)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10,
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(y_labels)):
for j in range(len(x_labels)):
text = ax.text(j, i, "%.2f"%values[i, j],
ha="center", va="center", color="w",
fontsize=6)
fig.tight_layout()
plt.show()
结果将是:
正如你所看到的,最相似的是文本本身和意义相近的文本之间。
重要的是:第一次运行代码会很慢,因为它需要下载模型。如果你想防止它再次下载模型并使用本地模型,你必须为缓存创建一个文件夹,并将其添加到环境变量中,然后在第一次运行后使用该路径:
tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir
# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)
更多信息:https://tfhub.dev/google/universal-sentence-encoder/2
I am combining the solutions from answers of @FredFoo and @Renaud. My solution is able to apply @Renaud's preprocessing on the text corpus of @FredFoo and then display pairwise similarities where the similarity is greater than 0. I ran this code on Windows by installing python and pip first. pip is installed as part of python but you may have to explicitly do it by re-running the installation package, choosing modify and then choosing pip. I use the command line to execute my python code saved in a file "similarity.py". I had to execute the following commands:
>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py
similar .py的代码如下:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
#view the pairwise similarities
print(pairwise_similarity)
#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))
Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity. For Python, you can use NLTK.
你可能想尝试一下cos文档相似度的在线服务http://www.scurtu.it/documentSimilarity.html
import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)
print responseObject