我如何有效地获得一个NumPy数组中每个唯一值的频率计数?

>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]

当前回答

为了计算唯一的非整数——类似于Eelco Hoogendoorn的答案,但速度要快得多(在我的机器上是5倍),我使用了weave。内联组合numpy。只有一点c代码;

import numpy as np
from scipy import weave

def count_unique(datain):
  """
  Similar to numpy.unique function for returning unique members of
  data, but also returns their counts
  """
  data = np.sort(datain)
  uniq = np.unique(data)
  nums = np.zeros(uniq.shape, dtype='int')

  code="""
  int i,count,j;
  j=0;
  count=0;
  for(i=1; i<Ndata[0]; i++){
      count++;
      if(data(i) > data(i-1)){
          nums(j) = count;
          count = 0;
          j++;
      }
  }
  // Handle last value
  nums(j) = count+1;
  """
  weave.inline(code,
      ['data', 'nums'],
      extra_compile_args=['-O2'],
      type_converters=weave.converters.blitz)
  return uniq, nums

配置文件信息

> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop

Eelco的纯numpy版本:

> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop

Note

这里存在冗余(unique也执行排序),这意味着可以通过将唯一功能放入c-code循环中来进一步优化代码。

其他回答

Most of simple problems get complicated because simple functionality like order() in R that gives a statistical result in both and descending order is missing in various python libraries. But if we devise our thinking that all such statistical ordering and parameters in python are easily found in pandas, we can can result sooner than looking in 100 different places. Also, development of R and pandas go hand-in-hand because they were created for same purpose. To solve this problem I use following code that gets me by anywhere:

unique, counts = np.unique(x, return_counts=True)
d = {'unique':unique, 'counts':count}  # pass the list to a dictionary
df = pd.DataFrame(d) #dictionary object can be easily passed to make a dataframe
df.sort_values(by = 'count', ascending=False, inplace = True)
df = df.reset_index(drop=True) #optional only if you want to use it further

老问题,但我想提供我自己的解决方案,这是最快的,使用普通列表而不是np。数组作为输入(或首先转移到列表),基于我的台架测试。

如果你也遇到这种情况,请检查一下。

def count(a):
    results = {}
    for x in a:
        if x not in results:
            results[x] = 1
        else:
            results[x] += 1
    return results

例如,

>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:

100000个循环,最好的3:2.26µs每循环

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))

100000个回路,最好的3:8.8µs每回路

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())

100000个回路,最佳3:5.85µs每回路

而公认的答案会更慢,而scipy.stats.itemfreq解决方案更糟糕。


更深入的测试并没有证实所制定的期望。

from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()

aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST  = aDataSETasARRAY.tolist()

import numba
@numba.jit
def numba_bincount( anObject ):
    np.bincount(    anObject )
    return

aZmqSTOPWATCH.start();np.bincount(    aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L

aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L

aZmqSTOPWATCH.start();count(          aDataSETasLIST  );aZmqSTOPWATCH.stop()
148609L

参考下面关于影响小型数据集大量重复测试结果的缓存和其他ram内副作用的评论。

尽管这个问题已经得到了回答,但我建议使用一种不同的方法,即numpy.histogram。这样的函数给定一个序列,它返回其元素分组在箱子中的频率。

但是要注意:它在这个例子中是有效的,因为数字是整数。如果它们是实数,那么这个解就不适用了。

>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1]),
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.]))
from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]

像这样的东西应该做到:

#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)

#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
    d[j]+=1   #increment when that value is found

另外,之前的这篇关于有效计算独特元素的文章似乎与您的问题非常相似,除非我遗漏了什么。