我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
Pearson相关性可以用numpy的corrcoef来计算。
import numpy
numpy.corrcoef(list1, list2)[0, 1]
其他回答
def correlation_score(y_true, y_pred):
"""Scores the predictions according to the competition rules.
It is assumed that the predictions are not constant.
Returns the average of each sample's Pearson correlation coefficient"""
y2 = y_pred.copy()
y2 -= y2.mean(axis=0); y2 /= y2.std(axis=0)
y1 = y_true.copy();
y1 -= y1.mean(axis=0); y1 /= y1.std(axis=0)
c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average
return c
def pearson(x,y):
n=len(x)
vals=range(n)
sumx=sum([float(x[i]) for i in vals])
sumy=sum([float(y[i]) for i in vals])
sumxSq=sum([x[i]**2.0 for i in vals])
sumySq=sum([y[i]**2.0 for i in vals])
pSum=sum([x[i]*y[i] for i in vals])
# Calculating Pearson correlation
num=pSum-(sumx*sumy/n)
den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
if den==0: return 0
r=num/den
return r
下面的代码是对该定义的直接解释:
import math
def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)
def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff
return diffprod / math.sqrt(xdiff2 * ydiff2)
测试:
print pearson_def([1,2,3], [1,5,7])
返回
0.981980506062
这与Excel,这个计算器,SciPy(也是NumPy)一致,分别返回0.981980506和0.9819805060619657,和0.98198050606196574。
R:
> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
编辑:修正了一个由评论者指出的错误。
如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
计算相关:
相关性-衡量两个不同变量的相似性
使用皮尔逊相关
from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column
使用斯皮尔曼相关
from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column
使用Kendall相关
kendall_correlation=final_data.corr(method='kendall')
kendall_correlation