我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
其他回答
嗯,很多回复的代码都很长,很难读…
我建议在处理数组时使用numpy及其漂亮的特性:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
你可以看看scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
下面是mkh答案的一个变体,比它运行得快得多,还有scipy.stats。皮尔逊,使用numba。
import numba
@numba.jit
def corr(data1, data2):
M = data1.size
sum1 = 0.
sum2 = 0.
for i in range(M):
sum1 += data1[i]
sum2 += data2[i]
mean1 = sum1 / M
mean2 = sum2 / M
var_sum1 = 0.
var_sum2 = 0.
cross_sum = 0.
for i in range(M):
var_sum1 += (data1[i] - mean1) ** 2
var_sum2 += (data2[i] - mean2) ** 2
cross_sum += (data1[i] * data2[i])
std1 = (var_sum1 / M) ** .5
std2 = (var_sum2 / M) ** .5
cross_mean = cross_sum / M
return (cross_mean - mean1 * mean2) / (std1 * std2)
与其依赖numpy/scipy,我认为我的答案应该是最容易编码和理解计算Pearson相关系数(PCC)的步骤。
import math
# calculates the mean
def mean(x):
sum = 0.0
for i in x:
sum += i
return sum / len(x)
# calculates the sample standard deviation
def sampleStandardDeviation(x):
sumv = 0.0
for i in x:
sumv += (i - mean(x))**2
return math.sqrt(sumv/(len(x)-1))
# calculates the PCC using both the 2 functions above
def pearson(x,y):
scorex = []
scorey = []
for i in x:
scorex.append((i - mean(x))/sampleStandardDeviation(x))
for j in y:
scorey.append((j - mean(y))/sampleStandardDeviation(y))
# multiplies both lists together into 1 list (hence zip) and sums the whole list
return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
PCC的意义基本上是向你展示两个变量/列表的相关性有多强。 需要注意的是,PCC值的范围是-1到1。 0到1之间的值表示正相关。 0值=最高变异(没有任何相关性)。 -1到0之间的值表示负相关。
从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
推荐文章
- 如何激活蟒蛇环境
- 省略[…]意思是在一个列表里?
- 为什么我得到“'str'对象没有属性'读取'”当尝试使用' json。载入字符串?
- 不区分大小写的列表排序,没有降低结果?
- 排序后的语法(key=lambda:…)
- 在烧瓶中返回HTTP状态代码201
- 使用python创建一个简单的XML文件
- APT命令行界面式的yes/no输入?
- 如何打印出状态栏和百分比?
- 在Python中获取大文件的MD5哈希值
- 在Python格式字符串中%s是什么意思?
- 如何循环通过所有但最后一项的列表?
- python用什么方法避免默认参数为空列表?
- ValueError: numpy。Ndarray大小改变,可能表示二进制不兼容。期望从C头得到88,从PyObject得到80
- Anaconda /conda -安装特定的软件包版本