我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。


当前回答

与其依赖numpy/scipy,我认为我的答案应该是最容易编码和理解计算Pearson相关系数(PCC)的步骤。

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

PCC的意义基本上是向你展示两个变量/列表的相关性有多强。 需要注意的是,PCC值的范围是-1到1。 0到1之间的值表示正相关。 0值=最高变异(没有任何相关性)。 -1到0之间的值表示负相关。

其他回答

def correlation_score(y_true, y_pred):
    """Scores the predictions according to the competition rules. 
    
    It is assumed that the predictions are not constant.
    
    Returns the average of each sample's Pearson correlation coefficient"""
    
    y2 = y_pred.copy()
    y2 -= y2.mean(axis=0);    y2 /= y2.std(axis=0) 
    y1 = y_true.copy(); 
    y1 -= y1.mean(axis=0);    y1 /= y1.std(axis=0) 
        
    c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average 
        
    return c

嗯,很多回复的代码都很长,很难读…

我建议在处理数组时使用numpy及其漂亮的特性:

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

您可能想知道如何在寻找特定方向的相关性(负相关或正相关)的上下文中解释您的概率。这是我写的一个函数。它甚至可能是正确的!

这是基于我从http://www.vassarstats.net/rsig.html和http://en.wikipedia.org/wiki/Student%27s_t_distribution上收集到的信息,感谢这里发布的其他答案。

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

计算相关:

相关性-衡量两个不同变量的相似性

使用皮尔逊相关

from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column

使用斯皮尔曼相关

from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column

使用Kendall相关

kendall_correlation=final_data.corr(method='kendall')
kendall_correlation