维基百科上的余弦相似度文章
你能在这里(以列表或其他形式)显示向量吗? 然后算一算,看看是怎么回事?
维基百科上的余弦相似度文章
你能在这里(以列表或其他形式)显示向量吗? 然后算一算,看看是怎么回事?
当前回答
为了简单起见,我化简了向量a和b:
Let :
a : [1, 1, 0]
b : [1, 0, 1]
那么余弦相似度(Theta)
(Theta) = (1*1 + 1*0 + 0*1)/sqrt((1^2 + 1^2))* sqrt((1^2 + 1^2)) = 1/2 = 0.5
cos0.5的逆是60度。
其他回答
两个向量A和B存在于二维空间或三维空间中,它们之间的夹角为cos相似度。
如果角度更大(可以达到最大180度),即Cos 180=-1,最小角度为0度。cos0 =1意味着向量是对齐的,因此向量是相似的。
cos 90=0(这足以得出向量A和B根本不相似,因为距离不能为负,余弦值将在0到1之间。因此,更多的角度意味着降低相似性(视觉化也有意义)
简单的JAVA代码计算余弦相似度
/**
* Method to calculate cosine similarity of vectors
* 1 - exactly similar (angle between them is 0)
* 0 - orthogonal vectors (angle between them is 90)
* @param vector1 - vector in the form [a1, a2, a3, ..... an]
* @param vector2 - vector in the form [b1, b2, b3, ..... bn]
* @return - the cosine similarity of vectors (ranges from 0 to 1)
*/
private double cosineSimilarity(List<Double> vector1, List<Double> vector2) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
for (int i = 0; i < vector1.size(); i++) {
dotProduct += vector1.get(i) * vector2.get(i);
normA += Math.pow(vector1.get(i), 2);
normB += Math.pow(vector2.get(i), 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
这是一个简单的Python代码,实现余弦相似度。
from scipy import linalg, mat, dot
import numpy as np
In [12]: matrix = mat( [[2, 1, 0, 2, 0, 1, 1, 1],[2, 1, 1, 1, 1, 0, 1, 1]] )
In [13]: matrix
Out[13]:
matrix([[2, 1, 0, 2, 0, 1, 1, 1],
[2, 1, 1, 1, 1, 0, 1, 1]])
In [14]: dot(matrix[0],matrix[1].T)/np.linalg.norm(matrix[0])/np.linalg.norm(matrix[1])
Out[14]: matrix([[ 0.82158384]])
以@Bill Bell为例,在[R]中有两种方法
a = c(2,1,0,2,0,1,1,1)
b = c(2,1,1,1,1,0,1,1)
d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
或者利用crossprod()方法的性能…
e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/**
*
* @author Xiao Ma
* mail : 409791952@qq.com
*
*/
public class SimilarityUtil {
public static double consineTextSimilarity(String[] left, String[] right) {
Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
Set<String> uniqueSet = new HashSet<String>();
Integer temp = null;
for (String leftWord : left) {
temp = leftWordCountMap.get(leftWord);
if (temp == null) {
leftWordCountMap.put(leftWord, 1);
uniqueSet.add(leftWord);
} else {
leftWordCountMap.put(leftWord, temp + 1);
}
}
for (String rightWord : right) {
temp = rightWordCountMap.get(rightWord);
if (temp == null) {
rightWordCountMap.put(rightWord, 1);
uniqueSet.add(rightWord);
} else {
rightWordCountMap.put(rightWord, temp + 1);
}
}
int[] leftVector = new int[uniqueSet.size()];
int[] rightVector = new int[uniqueSet.size()];
int index = 0;
Integer tempCount = 0;
for (String uniqueWord : uniqueSet) {
tempCount = leftWordCountMap.get(uniqueWord);
leftVector[index] = tempCount == null ? 0 : tempCount;
tempCount = rightWordCountMap.get(uniqueWord);
rightVector[index] = tempCount == null ? 0 : tempCount;
index++;
}
return consineVectorSimilarity(leftVector, rightVector);
}
/**
* The resulting similarity ranges from −1 meaning exactly opposite, to 1
* meaning exactly the same, with 0 usually indicating independence, and
* in-between values indicating intermediate similarity or dissimilarity.
*
* For text matching, the attribute vectors A and B are usually the term
* frequency vectors of the documents. The cosine similarity can be seen as
* a method of normalizing document length during comparison.
*
* In the case of information retrieval, the cosine similarity of two
* documents will range from 0 to 1, since the term frequencies (tf-idf
* weights) cannot be negative. The angle between two term frequency vectors
* cannot be greater than 90°.
*
* @param leftVector
* @param rightVector
* @return
*/
private static double consineVectorSimilarity(int[] leftVector,
int[] rightVector) {
if (leftVector.length != rightVector.length)
return 1;
double dotProduct = 0;
double leftNorm = 0;
double rightNorm = 0;
for (int i = 0; i < leftVector.length; i++) {
dotProduct += leftVector[i] * rightVector[i];
leftNorm += leftVector[i] * leftVector[i];
rightNorm += rightVector[i] * rightVector[i];
}
double result = dotProduct
/ (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
return result;
}
public static void main(String[] args) {
String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
"loves", "me" };
String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
"loves", "me" };
System.out.println(consineTextSimilarity(left,right));
}
}