我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
Pearson coefficient calculation using pandas in python: I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis
例子:
data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}
import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes
df = pd.DataFrame(data, columns = ['list 1','list 2'])
from scipy import stats # For in-built method to get PCC
pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results
但是,在分析之前,你没有发布你的数据给我看数据集的大小或可能需要的转换。
其他回答
你可以看看这篇文章。这是一个使用pandas库(适用于Python)根据多个文件的历史外汇货币对数据计算相关性的示例,然后使用seaborn库生成热图图。
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/
下面是mkh答案的一个变体,比它运行得快得多,还有scipy.stats。皮尔逊,使用numba。
import numba
@numba.jit
def corr(data1, data2):
M = data1.size
sum1 = 0.
sum2 = 0.
for i in range(M):
sum1 += data1[i]
sum2 += data2[i]
mean1 = sum1 / M
mean2 = sum2 / M
var_sum1 = 0.
var_sum2 = 0.
cross_sum = 0.
for i in range(M):
var_sum1 += (data1[i] - mean1) ** 2
var_sum2 += (data2[i] - mean2) ** 2
cross_sum += (data1[i] * data2[i])
std1 = (var_sum1 / M) ** .5
std2 = (var_sum2 / M) ** .5
cross_mean = cross_sum / M
return (cross_mean - mean1 * mean2) / (std1 * std2)
你可以看看scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
Pearson相关性可以用numpy的corrcoef来计算。
import numpy
numpy.corrcoef(list1, list2)[0, 1]
推荐文章
- 如何激活蟒蛇环境
- 省略[…]意思是在一个列表里?
- 为什么我得到“'str'对象没有属性'读取'”当尝试使用' json。载入字符串?
- 不区分大小写的列表排序,没有降低结果?
- 排序后的语法(key=lambda:…)
- 在烧瓶中返回HTTP状态代码201
- 使用python创建一个简单的XML文件
- APT命令行界面式的yes/no输入?
- 如何打印出状态栏和百分比?
- 在Python中获取大文件的MD5哈希值
- 在Python格式字符串中%s是什么意思?
- 如何循环通过所有但最后一项的列表?
- python用什么方法避免默认参数为空列表?
- ValueError: numpy。Ndarray大小改变,可能表示二进制不兼容。期望从C头得到88,从PyObject得到80
- Anaconda /conda -安装特定的软件包版本