我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。
我想取两个文档并确定它们有多相似。
我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。
我想取两个文档并确定它们有多相似。
当前回答
如果你正在寻找一些非常精确的东西,你需要使用一些比tf-idf更好的工具。通用句子编码器是最准确的找到任何两段文本之间的相似性的编码器之一。谷歌提供了预训练的模型,您可以将其用于自己的应用程序,而不需要从头开始训练任何东西。首先,你必须安装tensorflow和tensorflow-hub:
pip install tensorflow
pip install tensorflow_hub
下面的代码允许您将任何文本转换为固定长度的向量表示,然后您可以使用点积来找出它们之间的相似性
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/1?tf-hub-format=compressed"
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)
# sample text
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]
similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})
corr = np.inner(message_embeddings_, message_embeddings_)
print(corr)
heatmap(messages, messages, corr)
绘图的代码:
def heatmap(x_labels, y_labels, values):
fig, ax = plt.subplots()
im = ax.imshow(values)
# We want to show all ticks...
ax.set_xticks(np.arange(len(x_labels)))
ax.set_yticks(np.arange(len(y_labels)))
# ... and label them with the respective list entries
ax.set_xticklabels(x_labels)
ax.set_yticklabels(y_labels)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10,
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(y_labels)):
for j in range(len(x_labels)):
text = ax.text(j, i, "%.2f"%values[i, j],
ha="center", va="center", color="w",
fontsize=6)
fig.tight_layout()
plt.show()
结果将是:
正如你所看到的,最相似的是文本本身和意义相近的文本之间。
重要的是:第一次运行代码会很慢,因为它需要下载模型。如果你想防止它再次下载模型并使用本地模型,你必须为缓存创建一个文件夹,并将其添加到环境变量中,然后在第一次运行后使用该路径:
tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir
# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)
更多信息:https://tfhub.dev/google/universal-sentence-encoder/2
其他回答
为了用更少的数据集找到句子的相似性,并获得更高的精度,你可以使用下面的python包,它使用预训练的BERT模型,
pip install similar-sentences
这里是Simphile NLP文本相似性Python包的创建者。Simphile包含几种文本相似度方法,它们与语言无关,并且比语言嵌入占用的cpu更少。
安装:
pip install simphile
选择你最喜欢的方法。这个例子显示了三点:
from simphile import jaccard_similarity, euclidian_similarity, compression_similarity
text_a = "I love dogs"
text_b = "I love cats"
print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")
print(f"Euclidian Similarity: {euclidian_similarity(text_a, text_b)}")
print(f"Compression Similarity: {compression_similarity(text_a, text_b)}")
压缩相似性——利用压缩算法的模式识别 欧几里得相似性-把文本当作多维空间中的点,并计算它们的接近度 Jaccard Similairy -文字重叠越多,文本越相似
你可能想尝试一下cos文档相似度的在线服务http://www.scurtu.it/documentSimilarity.html
import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)
print responseObject
我们可以使用句子转换来完成这个任务 链接
下面是一个来自sbert的简单示例:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside']
sentences2 = ['The dog plays in the garden']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
sentences2[i], cosine_scores[i][i]))
I am combining the solutions from answers of @FredFoo and @Renaud. My solution is able to apply @Renaud's preprocessing on the text corpus of @FredFoo and then display pairwise similarities where the similarity is greater than 0. I ran this code on Windows by installing python and pip first. pip is installed as part of python but you may have to explicitly do it by re-running the installation package, choosing modify and then choosing pip. I use the command line to execute my python code saved in a file "similarity.py". I had to execute the following commands:
>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py
similar .py的代码如下:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
#view the pairwise similarities
print(pairwise_similarity)
#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))