我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
一个替代方法可以是一个来自linreturn的本地scipy函数,它计算:
斜率:回归线的斜率 截距:回归线的截距 R-value:相关系数 p值:零假设为斜率为零的假设检验的双面p值 stderr:估计的标准错误
这里有一个例子:
a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)
会回复你:
LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
其他回答
下面的代码是对该定义的直接解释:
import math
def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)
def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff
return diffprod / math.sqrt(xdiff2 * ydiff2)
测试:
print pearson_def([1,2,3], [1,5,7])
返回
0.981980506062
这与Excel,这个计算器,SciPy(也是NumPy)一致,分别返回0.981980506和0.9819805060619657,和0.98198050606196574。
R:
> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
编辑:修正了一个由评论者指出的错误。
本文给出了一种基于稀疏向量的pearson相关的实现方法。这里的向量表示为(index, value)表示的元组列表。两个稀疏向量可以是不同的长度,但总的向量大小必须是相同的。这对于文本挖掘应用程序非常有用,其中向量大小非常大,因为大多数特征都是单词包,因此通常使用稀疏向量执行计算。
def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
indexed_feature_dict = {}
if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")
sum_a = sum(value for index, value in first_feature_vector)
sum_b = sum(value for index, value in second_feature_vector)
avg_a = float(sum_a) / length_of_featureset
avg_b = float(sum_b) / length_of_featureset
mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))
covariance_a_b = 0
#calculate covariance for the sparse vectors
for tuple in first_feature_vector:
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
indexed_feature_dict[tuple[0]] = tuple[1]
count_of_features = 0
for tuple in second_feature_vector:
count_of_features += 1
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
if tuple[0] in indexed_feature_dict:
covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
del (indexed_feature_dict[tuple[0]])
else:
covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)
for index in indexed_feature_dict:
count_of_features += 1
covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)
#adjust covariance with rest of vector with 0 value
covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b
if mean_sq_error_a == 0 or mean_sq_error_b == 0:
return -1
else:
return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)
单元测试:
def test_get_get_pearson_corelation(self):
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)
你可以看看这篇文章。这是一个使用pandas库(适用于Python)根据多个文件的历史外汇货币对数据计算相关性的示例,然后使用seaborn库生成热图图。
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/
嗯,很多回复的代码都很长,很难读…
我建议在处理数组时使用numpy及其漂亮的特性:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
你可以用pandas.DataFrame这样做。相关系数:
import pandas as pd
a = [[1, 2, 3],
[5, 6, 9],
[5, 6, 11],
[5, 6, 13],
[5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
这给了
0 1 2
0 1.000000 0.745601 0.916579
1 0.745601 1.000000 0.544248
2 0.916579 0.544248 1.000000
推荐文章
- 证书验证失败:无法获得本地颁发者证书
- 当使用pip3安装包时,“Python中的ssl模块不可用”
- 无法切换Python与pyenv
- Python if not == vs if !=
- 如何从scikit-learn决策树中提取决策规则?
- 为什么在Mac OS X v10.9 (Mavericks)的终端中apt-get功能不起作用?
- 将旋转的xtick标签与各自的xtick对齐
- 为什么元组可以包含可变项?
- 如何合并字典的字典?
- 如何创建类属性?
- 不区分大小写的“in”
- 在Python中获取迭代器中的元素个数
- 解析日期字符串并更改格式
- 使用try和。Python中的if
- 如何在Python中获得所有直接子目录