我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。


当前回答

你可以看看这篇文章。这是一个使用pandas库(适用于Python)根据多个文件的历史外汇货币对数据计算相关性的示例,然后使用seaborn库生成热图图。

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

其他回答

对此,我有一个非常简单易懂的解决方案。对于两个长度相等的数组,Pearson系数可以很容易地计算如下:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:

from statistics import correlation

# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852

如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den
def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r

计算相关:

相关性-衡量两个不同变量的相似性

使用皮尔逊相关

from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column

使用斯皮尔曼相关

from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column

使用Kendall相关

kendall_correlation=final_data.corr(method='kendall')
kendall_correlation