如何找到每个系数的p值(显著性)?
lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)
如何找到每个系数的p值(显著性)?
lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)
当前回答
你可以用pingouin来写一行字。线性回归函数(免责声明:我是Pingouin的创建者),它使用NumPy数组或Pandas DataFrame与单/多变量回归一起工作,例如:
import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)
输出是一个数据框架,其中包含每个预测器的beta系数、标准误差、t值、p值和置信区间,以及拟合的R^2和调整后的R^2。
其他回答
scikit-learn的线性回归不计算这个信息,但你可以很容易地扩展类来做:
from sklearn import linear_model
from scipy import stats
import numpy as np
class LinearRegression(linear_model.LinearRegression):
"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
which is (n_features, n_coefs)
This class sets the intercept to 0 by default, since usually we include it
in X.
"""
def __init__(self, *args, **kwargs):
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression, self)\
.__init__(*args, **kwargs)
def fit(self, X, y, n_jobs=1):
self = super(LinearRegression, self).fit(X, y, n_jobs)
sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
se = np.array([
np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
for i in range(sse.shape[0])
])
self.t = self.coef_ / se
self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
return self
从这里偷来的。
您应该看一看Python中用于这种统计分析的统计模型。
elyase的答案https://stackoverflow.com/a/27928411/4240413中的代码实际上不起作用。注意,sse是一个标量,然后它尝试遍历它。下面的代码是修改后的版本。不是特别干净,但我觉得差不多能用。
class LinearRegression(linear_model.LinearRegression):
def __init__(self,*args,**kwargs):
# *args is the list of arguments that might go into the LinearRegression object
# that we don't know about and don't want to have to deal with. Similarly, **kwargs
# is a dictionary of key words and values that might also need to go into the orginal
# LinearRegression object. We put *args and **kwargs so that we don't have to look
# these up and write them down explicitly here. Nice and easy.
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression,self).__init__(*args,**kwargs)
# Adding in t-statistics for the coefficients.
def fit(self,x,y):
# This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
# of constants.
# Not totally sure what 'super' does here and why you redefine self...
self = super(LinearRegression, self).fit(x,y)
n, k = x.shape
yHat = np.matrix(self.predict(x)).T
# Change X and Y into numpy matricies. x also has a column of ones added to it.
x = np.hstack((np.ones((n,1)),np.matrix(x)))
y = np.matrix(y).T
# Degrees of freedom.
df = float(n-k-1)
# Sample variance.
sse = np.sum(np.square(yHat - y),axis=0)
self.sampleVariance = sse/df
# Sample variance for x.
self.sampleVarianceX = x.T*x
# Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly)
self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)
# Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
self.se = self.covarianceMatrix.diagonal()[1:]
# T statistic for each beta.
self.betasTStat = np.zeros(len(self.se))
for i in xrange(len(self.se)):
self.betasTStat[i] = self.coef_[0,i]/self.se[i]
# P-value for each beta. This is a two sided t-test, since the betas can be
# positive or negative.
self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)
在多变量回归的情况下,@JARH的答案可能有错误。 (我没有足够的声誉来评论。)
在下面一行:
p_values = [2 * (1-stats.t.cdf (np.abs(我),(len (newX) 1)))我在ts_b),
t值遵循degree len(newX)-1的卡方分布,而不是遵循degree len(newX)-len(newX.columns)-1的卡方分布。
所以这应该是:
p_values = [2 * (1-stats.t.cdf (np.abs(我),(len (newX) len (newX.columns) 1)))我在ts_b)
(详见OLS回归的t值)
另外一个已经提出的选择是使用排列测试。将y的值洗牌后对模型进行N次拟合,计算拟合后模型的系数相对于原模型的系数值较大(单侧检验)或绝对值较大(双面检验)的比例。这些比例就是p值。
P_value是f个统计值之一。如果你想要得到这个值,只需使用这几行代码:
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)