我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。


当前回答

从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:

from statistics import correlation

# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852

其他回答

Pearson相关性可以用numpy的corrcoef来计算。

import numpy
numpy.corrcoef(list1, list2)[0, 1]

Pearson coefficient calculation using pandas in python: I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis

例子:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results 

但是,在分析之前,你没有发布你的数据给我看数据集的大小或可能需要的转换。

计算相关:

相关性-衡量两个不同变量的相似性

使用皮尔逊相关

from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column

使用斯皮尔曼相关

from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column

使用Kendall相关

kendall_correlation=final_data.corr(method='kendall')
kendall_correlation

对此,我有一个非常简单易懂的解决方案。对于两个长度相等的数组,Pearson系数可以很容易地计算如下:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

本文给出了一种基于稀疏向量的pearson相关的实现方法。这里的向量表示为(index, value)表示的元组列表。两个稀疏向量可以是不同的长度,但总的向量大小必须是相同的。这对于文本挖掘应用程序非常有用,其中向量大小非常大,因为大多数特征都是单词包,因此通常使用稀疏向量执行计算。

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    #calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    #adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

单元测试:

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)