是否有一种方便的方法来计算一个序列或一维numpy数组的百分位数?

我正在寻找类似Excel的百分位数函数。

我在NumPy的统计参考中找不到这个。我所能找到的是中位数(第50百分位),但没有更具体的东西。


当前回答

要计算一个系列的百分位数,运行:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

例如:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}

其他回答

如果你需要答案是输入numpy数组的成员:

再加上numpy中的百分位数函数默认情况下将输出计算为输入向量中两个相邻项的线性加权平均。在某些情况下,人们可能希望返回的百分位数是向量的实际元素,在这种情况下,从v1.9.0开始,您可以使用“插值”选项,使用“低”、“高”或“最近”。

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

后者是向量中的一个实际条目,而前者是与百分位数相邻的两个向量条目的线性插值

import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile

我引导数据,然后绘制出10个样本的置信区间。置信区间表示概率在5%到95%之间的范围。

 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import numpy as np
 import json
 import dc_stat_think as dcst

 data = [154, 400, 1124, 82, 94, 108]
 #print (np.percentile(data,[0.5,95])) # gives the 95th percentile

 bs_data = dcst.draw_bs_reps(data, np.mean, size=6*10)

 #print(np.reshape(bs_data,(24,6)))

 x= np.linspace(1,6,6)
 print(x)
 for (item1,item2,item3,item4,item5,item6) in bs_data.reshape((10,6)):
     line_data=[item1,item2,item3,item4,item5,item6]
     ci=np.percentile(line_data,[.025,.975])
     mean_avg=np.mean(line_data)
     fig, ax = plt.subplots()
     ax.plot(x,line_data)
     ax.fill_between(x, (line_data-ci[0]), (line_data+ci[1]), color='b', alpha=.1)
     ax.axhline(mean_avg,color='red')
     plt.show()

计算一维numpy序列或矩阵的百分位数的一种方便方法是使用numpy。百分位< https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html >。例子:

import numpy as np

a = np.array([0,1,2,3,4,5,6,7,8,9,10])
p50 = np.percentile(a, 50) # return 50th percentile, e.g median.
p90 = np.percentile(a, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.0  and p90 =  9.0

但是,如果您的数据中有任何NaN值,则上述函数将没有用处。在这种情况下,推荐使用numpy函数。Nanpercentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html>函数:

import numpy as np

a_NaN = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])
a_NaN[0] = np.nan
print('a_NaN',a_NaN)
p50 = np.nanpercentile(a_NaN, 50) # return 50th percentile, e.g median.
p90 = np.nanpercentile(a_NaN, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.5  and p90 =  9.1

在上面给出的两个选项中,您仍然可以选择插值模式。为了更容易理解,请参考下面的例子。

import numpy as np

b = np.array([1,2,3,4,5,6,7,8,9,10])
print('percentiles using default interpolation')
p10 = np.percentile(b, 10) # return 10th percentile.
p50 = np.percentile(b, 50) # return 50th percentile, e.g median.
p90 = np.percentile(b, 90) # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "linear")
p10 = np.percentile(b, 10,interpolation='linear') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='linear') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='linear') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "lower")
p10 = np.percentile(b, 10,interpolation='lower') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='lower') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='lower') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1 , median =  5  and p90 =  9

print('percentiles using interpolation = ', "higher")
p10 = np.percentile(b, 10,interpolation='higher') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='higher') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='higher') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  6  and p90 =  10

print('percentiles using interpolation = ', "midpoint")
p10 = np.percentile(b, 10,interpolation='midpoint') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='midpoint') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='midpoint') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.5 , median =  5.5  and p90 =  9.5

print('percentiles using interpolation = ', "nearest")
p10 = np.percentile(b, 10,interpolation='nearest') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='nearest') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='nearest') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  5  and p90 =  9

如果您的输入数组只包含整数值,那么您可能对作为整数的百分比答案感兴趣。如果是这样,选择插值模式,如'低','高',或'最近'。

您可能会对SciPy Stats包感兴趣。它有你所追求的百分位数函数和许多其他统计上的好处。

Percentile()在numpy中也可用。

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0

这张票让我相信他们不会很快将percentile()集成到numpy中。