找到Python列表中最常见元素的有效方法是什么?
我的列表项可能不是可哈希的,所以不能使用字典。 同样,在抽取的情况下,应返回索引最低的项。例子:
>>> most_common(['duck', 'duck', 'goose'])
'duck'
>>> most_common(['goose', 'duck', 'duck', 'goose'])
'goose'
找到Python列表中最常见元素的有效方法是什么?
我的列表项可能不是可哈希的,所以不能使用字典。 同样,在抽取的情况下,应返回索引最低的项。例子:
>>> most_common(['duck', 'duck', 'goose'])
'duck'
>>> most_common(['goose', 'duck', 'duck', 'goose'])
'goose'
当前回答
#This will return the list sorted by frequency:
def orderByFrequency(list):
listUniqueValues = np.unique(list)
listQty = []
listOrderedByFrequency = []
for i in range(len(listUniqueValues)):
listQty.append(list.count(listUniqueValues[i]))
for i in range(len(listQty)):
index_bigger = np.argmax(listQty)
for j in range(listQty[index_bigger]):
listOrderedByFrequency.append(listUniqueValues[index_bigger])
listQty[index_bigger] = -1
return listOrderedByFrequency
#And this will return a list with the most frequent values in a list:
def getMostFrequentValues(list):
if (len(list) <= 1):
return list
list_most_frequent = []
list_ordered_by_frequency = orderByFrequency(list)
list_most_frequent.append(list_ordered_by_frequency[0])
frequency = list_ordered_by_frequency.count(list_ordered_by_frequency[0])
index = 0
while(index < len(list_ordered_by_frequency)):
index = index + frequency
if(index < len(list_ordered_by_frequency)):
testValue = list_ordered_by_frequency[index]
testValueFrequency = list_ordered_by_frequency.count(testValue)
if (testValueFrequency == frequency):
list_most_frequent.append(testValue)
else:
break
return list_most_frequent
#tests:
print(getMostFrequentValues([]))
print(getMostFrequentValues([1]))
print(getMostFrequentValues([1,1]))
print(getMostFrequentValues([2,1]))
print(getMostFrequentValues([2,2,1]))
print(getMostFrequentValues([1,2,1,2]))
print(getMostFrequentValues([1,2,1,2,2]))
print(getMostFrequentValues([3,2,3,5,6,3,2,2]))
print(getMostFrequentValues([1,2,2,60,50,3,3,50,3,4,50,4,4,60,60]))
Results:
[]
[1]
[1]
[1, 2]
[2]
[1, 2]
[2]
[2, 3]
[3, 4, 50, 60]
其他回答
def most_common(lst):
if max([lst.count(i)for i in lst]) == 1:
return False
else:
return max(set(lst), key=lst.count)
我在最近的一个项目中需要这样做。我承认,我无法理解Alex的回答,所以这就是我最后得到的答案。
def mostPopular(l):
mpEl=None
mpIndex=0
mpCount=0
curEl=None
curCount=0
for i, el in sorted(enumerate(l), key=lambda x: (x[1], x[0]), reverse=True):
curCount=curCount+1 if el==curEl else 1
curEl=el
if curCount>mpCount \
or (curCount==mpCount and i<mpIndex):
mpEl=curEl
mpIndex=i
mpCount=curCount
return mpEl, mpCount, mpIndex
我根据Alex的解决方案计时,对于短列表,它要快10-15%,但一旦超过100个或更多元素(测试多达20万个),它就会慢20%。
提出了这么多解决方案,我很惊讶没有人提出我认为明显的解决方案(对于不可哈希但可比较的元素)——[itertools.groupby][1]。Itertools提供了快速、可重用的功能,并允许您将一些棘手的逻辑委托给经过良好测试的标准库组件。举个例子:
import itertools
import operator
def most_common(L):
# get an iterable of (item, iterable) pairs
SL = sorted((x, i) for i, x in enumerate(L))
# print 'SL:', SL
groups = itertools.groupby(SL, key=operator.itemgetter(0))
# auxiliary function to get "quality" for an item
def _auxfun(g):
item, iterable = g
count = 0
min_index = len(L)
for _, where in iterable:
count += 1
min_index = min(min_index, where)
# print 'item %r, count %r, minind %r' % (item, count, min_index)
return count, -min_index
# pick the highest-count/earliest item
return max(groups, key=_auxfun)[0]
当然,这可以写得更简洁,但我的目标是尽可能清晰。这两个print语句可以取消注释,以便更好地查看运行中的机制;例如,不加注释的打印:
print most_common(['goose', 'duck', 'duck', 'goose'])
发出:
SL: [('duck', 1), ('duck', 2), ('goose', 0), ('goose', 3)]
item 'duck', count 2, minind 1
item 'goose', count 2, minind 0
goose
如您所见,SL是一个对的列表,每对都是一个项目,后面是该项目在原始列表中的索引(为了实现关键条件,如果具有相同最高计数的“最常见”项目是> 1,则结果必须是最早出现的项目)。
Groupby只根据项目分组(通过operator.itemgetter)。辅助函数在max计算期间每分组调用一次,接收并在内部解包一个组——一个包含两个项目(item, iterable)的元组,其中iterable的项目也是两个项目元组,(item,原始索引)[[SL的项目]]。
然后辅助函数使用循环来确定组的可迭代对象中条目的计数和最小原始索引;它将返回这些组合的“质量键”,并更改了最小索引符号,以便Max操作将考虑原始列表中较早出现的那些项。
如果在时间和空间上少考虑一些大o问题,这段代码就会简单得多,例如....:
def most_common(L):
groups = itertools.groupby(sorted(L))
def _auxfun((item, iterable)):
return len(list(iterable)), -L.index(item)
return max(groups, key=_auxfun)[0]
同样的基本思想,只是表达得更简单简洁……但是,唉,额外的O(N)辅助空间(将组的可迭代对象包含到列表中)和O(N平方)时间(获得每个项目的L.index)。虽然过早的优化是编程中的万恶之源,但当O(N log N)方法可用时,却故意选择O(N²)方法,这太违背可伸缩性了!-)
最后,对于那些更喜欢“单行程序”而不是清晰和性能的人来说,一个额外的单行程序版本,它的名字被适当地扭曲了:-)。
from itertools import groupby as g
def most_common_oneliner(L):
return max(g(sorted(L)), key=lambda(x, v):(len(list(v)),-L.index(x)))[0]
简单的一行代码:
def most_common(lst):
return max(set(lst), key=lst.count)
# use Decorate, Sort, Undecorate to solve the problem
def most_common(iterable):
# Make a list with tuples: (item, index)
# The index will be used later to break ties for most common item.
lst = [(x, i) for i, x in enumerate(iterable)]
lst.sort()
# lst_final will also be a list of tuples: (count, index, item)
# Sorting on this list will find us the most common item, and the index
# will break ties so the one listed first wins. Count is negative so
# largest count will have lowest value and sort first.
lst_final = []
# Get an iterator for our new list...
itr = iter(lst)
# ...and pop the first tuple off. Setup current state vars for loop.
count = 1
tup = next(itr)
x_cur, i_cur = tup
# Loop over sorted list of tuples, counting occurrences of item.
for tup in itr:
# Same item again?
if x_cur == tup[0]:
# Yes, same item; increment count
count += 1
else:
# No, new item, so write previous current item to lst_final...
t = (-count, i_cur, x_cur)
lst_final.append(t)
# ...and reset current state vars for loop.
x_cur, i_cur = tup
count = 1
# Write final item after loop ends
t = (-count, i_cur, x_cur)
lst_final.append(t)
lst_final.sort()
answer = lst_final[0][2]
return answer
print most_common(['x', 'e', 'a', 'e', 'a', 'e', 'e']) # prints 'e'
print most_common(['goose', 'duck', 'duck', 'goose']) # prints 'goose'