我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
你可以用pandas.DataFrame这样做。相关系数:
import pandas as pd
a = [[1, 2, 3],
[5, 6, 9],
[5, 6, 11],
[5, 6, 13],
[5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
这给了
0 1 2
0 1.000000 0.745601 0.916579
1 0.745601 1.000000 0.544248
2 0.916579 0.544248 1.000000
其他回答
对此,我有一个非常简单易懂的解决方案。对于两个长度相等的数组,Pearson系数可以很容易地计算如下:
def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient.
Numerator is the sum of product of (a - a_avg) and (b - b_avg),
while denominator is the product of a_std and b_std multiplied by
length of array.
"""
a_avg, b_avg = np.average(a), np.average(b)
a_stdev, b_stdev = np.std(a), np.std(b)
n = len(a)
denominator = a_stdev * b_stdev * n
numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
p_coef = numerator/denominator
return p_coef
如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
你可以看看这篇文章。这是一个使用pandas库(适用于Python)根据多个文件的历史外汇货币对数据计算相关性的示例,然后使用seaborn库生成热图图。
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/
你可以看看scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
与其依赖numpy/scipy,我认为我的答案应该是最容易编码和理解计算Pearson相关系数(PCC)的步骤。
import math
# calculates the mean
def mean(x):
sum = 0.0
for i in x:
sum += i
return sum / len(x)
# calculates the sample standard deviation
def sampleStandardDeviation(x):
sumv = 0.0
for i in x:
sumv += (i - mean(x))**2
return math.sqrt(sumv/(len(x)-1))
# calculates the PCC using both the 2 functions above
def pearson(x,y):
scorex = []
scorey = []
for i in x:
scorex.append((i - mean(x))/sampleStandardDeviation(x))
for j in y:
scorey.append((j - mean(y))/sampleStandardDeviation(y))
# multiplies both lists together into 1 list (hence zip) and sums the whole list
return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
PCC的意义基本上是向你展示两个变量/列表的相关性有多强。 需要注意的是,PCC值的范围是-1到1。 0到1之间的值表示正相关。 0值=最高变异(没有任何相关性)。 -1到0之间的值表示负相关。
推荐文章
- 为什么我得到“'str'对象没有属性'读取'”当尝试使用' json。载入字符串?
- 不区分大小写的列表排序,没有降低结果?
- 排序后的语法(key=lambda:…)
- 在烧瓶中返回HTTP状态代码201
- 使用python创建一个简单的XML文件
- APT命令行界面式的yes/no输入?
- 如何打印出状态栏和百分比?
- 在Python中获取大文件的MD5哈希值
- 在Python格式字符串中%s是什么意思?
- 如何循环通过所有但最后一项的列表?
- python用什么方法避免默认参数为空列表?
- ValueError: numpy。Ndarray大小改变,可能表示二进制不兼容。期望从C头得到88,从PyObject得到80
- Anaconda /conda -安装特定的软件包版本
- 我在哪里调用Keras的BatchNormalization函数?
- 打印测试执行时间并使用py.test锁定缓慢的测试