我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
def pearson(x,y):
n=len(x)
vals=range(n)
sumx=sum([float(x[i]) for i in vals])
sumy=sum([float(y[i]) for i in vals])
sumxSq=sum([x[i]**2.0 for i in vals])
sumySq=sum([y[i]**2.0 for i in vals])
pSum=sum([x[i]*y[i] for i in vals])
# Calculating Pearson correlation
num=pSum-(sumx*sumy/n)
den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
if den==0: return 0
r=num/den
return r
其他回答
def correlation_score(y_true, y_pred):
"""Scores the predictions according to the competition rules.
It is assumed that the predictions are not constant.
Returns the average of each sample's Pearson correlation coefficient"""
y2 = y_pred.copy()
y2 -= y2.mean(axis=0); y2 /= y2.std(axis=0)
y1 = y_true.copy();
y1 -= y1.mean(axis=0); y1 /= y1.std(axis=0)
c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average
return c
下面是mkh答案的一个变体,比它运行得快得多,还有scipy.stats。皮尔逊,使用numba。
import numba
@numba.jit
def corr(data1, data2):
M = data1.size
sum1 = 0.
sum2 = 0.
for i in range(M):
sum1 += data1[i]
sum2 += data2[i]
mean1 = sum1 / M
mean2 = sum2 / M
var_sum1 = 0.
var_sum2 = 0.
cross_sum = 0.
for i in range(M):
var_sum1 += (data1[i] - mean1) ** 2
var_sum2 += (data2[i] - mean2) ** 2
cross_sum += (data1[i] * data2[i])
std1 = (var_sum1 / M) ** .5
std2 = (var_sum2 / M) ** .5
cross_mean = cross_sum / M
return (cross_mean - mean1 * mean2) / (std1 * std2)
从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
你可以看看scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
推荐文章
- 证书验证失败:无法获得本地颁发者证书
- 当使用pip3安装包时,“Python中的ssl模块不可用”
- 无法切换Python与pyenv
- Python if not == vs if !=
- 如何从scikit-learn决策树中提取决策规则?
- 为什么在Mac OS X v10.9 (Mavericks)的终端中apt-get功能不起作用?
- 将旋转的xtick标签与各自的xtick对齐
- 为什么元组可以包含可变项?
- 如何合并字典的字典?
- 如何创建类属性?
- 不区分大小写的“in”
- 在Python中获取迭代器中的元素个数
- 解析日期字符串并更改格式
- 使用try和。Python中的if
- 如何在Python中获得所有直接子目录