我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
这是使用numpy的Pearson Correlation函数的实现:
def corr(data1, data2):
"data1 & data2 should be numpy arrays."
mean1 = data1.mean()
mean2 = data2.mean()
std1 = data1.std()
std2 = data2.std()
# corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
return corr
其他回答
下面的代码是对该定义的直接解释:
import math
def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)
def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff
return diffprod / math.sqrt(xdiff2 * ydiff2)
测试:
print pearson_def([1,2,3], [1,5,7])
返回
0.981980506062
这与Excel,这个计算器,SciPy(也是NumPy)一致,分别返回0.981980506和0.9819805060619657,和0.98198050606196574。
R:
> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
编辑:修正了一个由评论者指出的错误。
计算相关:
相关性-衡量两个不同变量的相似性
使用皮尔逊相关
from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column
使用斯皮尔曼相关
from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column
使用Kendall相关
kendall_correlation=final_data.corr(method='kendall')
kendall_correlation
你可以看看scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
推荐文章
- 为什么我得到“'str'对象没有属性'读取'”当尝试使用' json。载入字符串?
- 不区分大小写的列表排序,没有降低结果?
- 排序后的语法(key=lambda:…)
- 在烧瓶中返回HTTP状态代码201
- 使用python创建一个简单的XML文件
- APT命令行界面式的yes/no输入?
- 如何打印出状态栏和百分比?
- 在Python中获取大文件的MD5哈希值
- 在Python格式字符串中%s是什么意思?
- 如何循环通过所有但最后一项的列表?
- python用什么方法避免默认参数为空列表?
- ValueError: numpy。Ndarray大小改变,可能表示二进制不兼容。期望从C头得到88,从PyObject得到80
- Anaconda /conda -安装特定的软件包版本
- 我在哪里调用Keras的BatchNormalization函数?
- 打印测试执行时间并使用py.test锁定缓慢的测试