我如何有效地获得一个NumPy数组中每个唯一值的频率计数?
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
我如何有效地获得一个NumPy数组中每个唯一值的频率计数?
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
当前回答
这是迄今为止最通用和性能最好的解决方案;很惊讶它还没有发布。
import numpy as np
def unique_count(a):
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return np.vstack(( unique, count)).T
print unique_count(np.random.randint(-10,10,100))
与目前接受的答案不同,它适用于任何可排序的数据类型(不仅仅是正整数),并且具有最佳性能;唯一重要的开销是np.unique所做的排序。
其他回答
为了计算唯一的非整数——类似于Eelco Hoogendoorn的答案,但速度要快得多(在我的机器上是5倍),我使用了weave。内联组合numpy。只有一点c代码;
import numpy as np
from scipy import weave
def count_unique(datain):
"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype='int')
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,
['data', 'nums'],
extra_compile_args=['-O2'],
type_converters=weave.converters.blitz)
return uniq, nums
配置文件信息
> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop
Eelco的纯numpy版本:
> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop
Note
这里存在冗余(unique也执行排序),这意味着可以通过将唯一功能放入c-code循环中来进一步优化代码。
尽管这个问题已经得到了回答,但我建议使用一种不同的方法,即numpy.histogram。这样的函数给定一个序列,它返回其元素分组在箱子中的频率。
但是要注意:它在这个例子中是有效的,因为数字是整数。如果它们是实数,那么这个解就不适用了。
>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1]),
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22.,
23., 24., 25.]))
像这样的东西应该做到:
#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)
#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
d[j]+=1 #increment when that value is found
另外,之前的这篇关于有效计算独特元素的文章似乎与您的问题非常相似,除非我遗漏了什么。
Most of simple problems get complicated because simple functionality like order() in R that gives a statistical result in both and descending order is missing in various python libraries. But if we devise our thinking that all such statistical ordering and parameters in python are easily found in pandas, we can can result sooner than looking in 100 different places. Also, development of R and pandas go hand-in-hand because they were created for same purpose. To solve this problem I use following code that gets me by anywhere:
unique, counts = np.unique(x, return_counts=True)
d = {'unique':unique, 'counts':count} # pass the list to a dictionary
df = pd.DataFrame(d) #dictionary object can be easily passed to make a dataframe
df.sort_values(by = 'count', ascending=False, inplace = True)
df = df.reset_index(drop=True) #optional only if you want to use it further
我对此也很感兴趣,所以我做了一点性能比较(使用perfplot,我的一个爱好项目)。结果:
y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T
是目前为止最快的。(请注意对数缩放。)
代码生成的情节:
import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq
def bincount(a):
y = np.bincount(a)
ii = np.nonzero(y)[0]
return np.vstack((ii, y[ii])).T
def unique(a):
unique, counts = np.unique(a, return_counts=True)
return np.asarray((unique, counts)).T
def unique_count(a):
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), dtype=int)
np.add.at(count, inverse, 1)
return np.vstack((unique, count)).T
def pandas_value_counts(a):
out = pd.value_counts(pd.Series(a))
out.sort_index(inplace=True)
out = np.stack([out.keys().values, out.values]).T
return out
b = perfplot.bench(
setup=lambda n: np.random.randint(0, 1000, n),
kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
n_range=[2 ** k for k in range(26)],
xlabel="len(a)",
)
b.save("out.png")
b.show()