我如何有效地获得一个NumPy数组中每个唯一值的频率计数?
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
我如何有效地获得一个NumPy数组中每个唯一值的频率计数?
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
当前回答
使用numpy。唯一的return_counts=True (NumPy 1.9+):
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)
>>> print(np.asarray((unique, counts)).T)
[[ 1 5]
[ 2 3]
[ 5 1]
[25 1]]
与scipy.stats.itemfreq相比:
In [4]: x = np.random.random_integers(0,100,1e6)
In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop
In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop
其他回答
我对此也很感兴趣,所以我做了一点性能比较(使用perfplot,我的一个爱好项目)。结果:
y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T
是目前为止最快的。(请注意对数缩放。)
代码生成的情节:
import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq
def bincount(a):
y = np.bincount(a)
ii = np.nonzero(y)[0]
return np.vstack((ii, y[ii])).T
def unique(a):
unique, counts = np.unique(a, return_counts=True)
return np.asarray((unique, counts)).T
def unique_count(a):
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), dtype=int)
np.add.at(count, inverse, 1)
return np.vstack((unique, count)).T
def pandas_value_counts(a):
out = pd.value_counts(pd.Series(a))
out.sort_index(inplace=True)
out = np.stack([out.keys().values, out.values]).T
return out
b = perfplot.bench(
setup=lambda n: np.random.randint(0, 1000, n),
kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
n_range=[2 ** k for k in range(26)],
xlabel="len(a)",
)
b.save("out.png")
b.show()
为了计算唯一的非整数——类似于Eelco Hoogendoorn的答案,但速度要快得多(在我的机器上是5倍),我使用了weave。内联组合numpy。只有一点c代码;
import numpy as np
from scipy import weave
def count_unique(datain):
"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype='int')
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,
['data', 'nums'],
extra_compile_args=['-O2'],
type_converters=weave.converters.blitz)
return uniq, nums
配置文件信息
> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop
Eelco的纯numpy版本:
> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop
Note
这里存在冗余(unique也执行排序),这意味着可以通过将唯一功能放入c-code循环中来进一步优化代码。
你可以这样写freq_count:
def freq_count(data):
mp = dict();
for i in data:
if i in mp:
mp[i] = mp[i]+1
else:
mp[i] = 1
return mp
from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]
这是迄今为止最通用和性能最好的解决方案;很惊讶它还没有发布。
import numpy as np
def unique_count(a):
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return np.vstack(( unique, count)).T
print unique_count(np.random.randint(-10,10,100))
与目前接受的答案不同,它适用于任何可排序的数据类型(不仅仅是正整数),并且具有最佳性能;唯一重要的开销是np.unique所做的排序。