谁能举一个余弦相似度的例子，用一个简单的图形化的方法?

维基百科上的余弦相似度文章

你能在这里(以列表或其他形式)显示向量吗? 然后算一算，看看是怎么回事?

当前回答

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * 
* @author Xiao Ma
* mail : 409791952@qq.com
*
*/
  public class SimilarityUtil {

public static double consineTextSimilarity(String[] left, String[] right) {
    Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
    Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
    Set<String> uniqueSet = new HashSet<String>();
    Integer temp = null;
    for (String leftWord : left) {
        temp = leftWordCountMap.get(leftWord);
        if (temp == null) {
            leftWordCountMap.put(leftWord, 1);
            uniqueSet.add(leftWord);
        } else {
            leftWordCountMap.put(leftWord, temp + 1);
        }
    }
    for (String rightWord : right) {
        temp = rightWordCountMap.get(rightWord);
        if (temp == null) {
            rightWordCountMap.put(rightWord, 1);
            uniqueSet.add(rightWord);
        } else {
            rightWordCountMap.put(rightWord, temp + 1);
        }
    }
    int[] leftVector = new int[uniqueSet.size()];
    int[] rightVector = new int[uniqueSet.size()];
    int index = 0;
    Integer tempCount = 0;
    for (String uniqueWord : uniqueSet) {
        tempCount = leftWordCountMap.get(uniqueWord);
        leftVector[index] = tempCount == null ? 0 : tempCount;
        tempCount = rightWordCountMap.get(uniqueWord);
        rightVector[index] = tempCount == null ? 0 : tempCount;
        index++;
    }
    return consineVectorSimilarity(leftVector, rightVector);
}

/**
 * The resulting similarity ranges from −1 meaning exactly opposite, to 1
 * meaning exactly the same, with 0 usually indicating independence, and
 * in-between values indicating intermediate similarity or dissimilarity.
 * 
 * For text matching, the attribute vectors A and B are usually the term
 * frequency vectors of the documents. The cosine similarity can be seen as
 * a method of normalizing document length during comparison.
 * 
 * In the case of information retrieval, the cosine similarity of two
 * documents will range from 0 to 1, since the term frequencies (tf-idf
 * weights) cannot be negative. The angle between two term frequency vectors
 * cannot be greater than 90°.
 * 
 * @param leftVector
 * @param rightVector
 * @return
 */
private static double consineVectorSimilarity(int[] leftVector,
        int[] rightVector) {
    if (leftVector.length != rightVector.length)
        return 1;
    double dotProduct = 0;
    double leftNorm = 0;
    double rightNorm = 0;
    for (int i = 0; i < leftVector.length; i++) {
        dotProduct += leftVector[i] * rightVector[i];
        leftNorm += leftVector[i] * leftVector[i];
        rightNorm += rightVector[i] * rightVector[i];
    }

    double result = dotProduct
            / (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
    return result;
}

public static void main(String[] args) {
    String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
            "loves", "me" };
    String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
            "loves", "me" };
    System.out.println(consineTextSimilarity(left,right));
}
}

2013-07-29 03:33:48

其他回答

以@Bill Bell为例，在[R]中有两种方法

a = c(2,1,0,2,0,1,1,1)

b = c(2,1,1,1,1,0,1,1)

d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

或者利用crossprod()方法的性能…

e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))

2013-10-03 14:54:39

这是我在c#中的实现。

using System;

namespace CosineSimilarity
{
    class Program
    {
        static void Main()
        {
            int[] vecA = {1, 2, 3, 4, 5};
            int[] vecB = {6, 7, 7, 9, 10};

            var cosSimilarity = CalculateCosineSimilarity(vecA, vecB);

            Console.WriteLine(cosSimilarity);
            Console.Read();
        }

        private static double CalculateCosineSimilarity(int[] vecA, int[] vecB)
        {
            var dotProduct = DotProduct(vecA, vecB);
            var magnitudeOfA = Magnitude(vecA);
            var magnitudeOfB = Magnitude(vecB);

            return dotProduct/(magnitudeOfA*magnitudeOfB);
        }

        private static double DotProduct(int[] vecA, int[] vecB)
        {
            // I'm not validating inputs here for simplicity.            
            double dotProduct = 0;
            for (var i = 0; i < vecA.Length; i++)
            {
                dotProduct += (vecA[i] * vecB[i]);
            }

            return dotProduct;
        }

        // Magnitude of the vector is the square root of the dot product of the vector with itself.
        private static double Magnitude(int[] vector)
        {
            return Math.Sqrt(DotProduct(vector, vector));
        }
    }
}

2009-11-17 05:09:17

这段Python代码是我实现算法的快速而肮脏的尝试:

import math
from collections import Counter

def build_vector(iterable1, iterable2):
    counter1 = Counter(iterable1)
    counter2 = Counter(iterable2)
    all_items = set(counter1.keys()).union(set(counter2.keys()))
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2

def cosim(v1, v2):
    dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
    magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
    magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
    return dot_product / (magnitude1 * magnitude2)


l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()


v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))

2013-02-19 02:17:07

简单的JAVA代码计算余弦相似度

/**
   * Method to calculate cosine similarity of vectors
   * 1 - exactly similar (angle between them is 0)
   * 0 - orthogonal vectors (angle between them is 90)
   * @param vector1 - vector in the form [a1, a2, a3, ..... an]
   * @param vector2 - vector in the form [b1, b2, b3, ..... bn]
   * @return - the cosine similarity of vectors (ranges from 0 to 1)
   */
  private double cosineSimilarity(List<Double> vector1, List<Double> vector2) {

    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vector1.size(); i++) {
      dotProduct += vector1.get(i) * vector2.get(i);
      normA += Math.pow(vector1.get(i), 2);
      normB += Math.pow(vector2.get(i), 2);
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }

2016-10-21 13:12:24

我猜你更感兴趣的是深入了解余弦相似度工作的“为什么”(为什么它提供了一个很好的相似性指示)，而不是“如何”计算它(用于计算的具体操作)。如果您对后者感兴趣，请参阅Daniel在这篇文章中指出的参考文献，以及相关的SO问题。

为了解释如何，甚至是为什么，首先，简化问题并只在二维空间中工作是有用的。一旦你在2D中得到了这个，在三维中考虑它就更容易了，当然在更大的维度中很难想象，但到那时我们可以用线性代数来进行数值计算，也可以帮助我们在n维中考虑直线/向量/“平面”/“球”，尽管我们不能画出来。

So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.

VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.

现在，不太相似的文件可能也会提到两个城市，但比例不同。也许文档2只会提到巴黎一次，伦敦七次。

回到我们的x-y平面，如果我们画出这些假设的文档，我们会看到当它们非常相似时，它们的向量会重叠(尽管有些向量可能更长)，当它们的共同点开始减少时，这些向量开始发散，它们之间的角度会更大。

通过测量向量之间的夹角，我们可以很好地了解它们的相似性，为了让事情变得更简单，通过计算这个夹角的余弦，我们有一个很好的0到1或-1到1的值，这表明了这种相似性，这取决于我们解释什么以及如何解释。角度越小，余弦值越大(越接近1)，相似度也越高。

在极端情况下，如果Doc1只引用巴黎，而Doc2只引用伦敦，那么这些文档绝对没有任何共同之处。Doc1的向量在x轴上，Doc2的向量在y轴上，角度是90度，cos0。在这种情况下，我们会说这些文档彼此正交。

Adding dimensions: With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.

我将用几句话来结束这个公式本身。正如我所说，其他参考文献提供了关于计算的良好信息。

首先是二维空间。两个向量夹角余弦的公式是由三角函数差(角a和角b之间)推导出来的:

cos(a - b) = (cos(a) * cos(b)) + (sin (a) * sin(b))

这个公式看起来很类似于点积公式:

Vect1 . Vect2 =  (x1 * x2) + (y1 * y2)

其中cos(a)对应x值，sin(a)对应y值，对于第一个向量，等等。唯一的问题是，x, y等并不完全是cos和sin值，因为这些值需要在单位圆上读取。这就是公式分母的作用所在:通过除以这些向量长度的乘积，x和y坐标就变得标准化了。

2009-11-17 06:04:30

谁能举一个余弦相似度的例子，用一个简单的图形化的方法?

推荐文章

最新文章

标签