一般来说,有没有一种有效的方法可以知道Python中的迭代器中有多少个元素,而不用遍历每个元素并计数?
当前回答
不能(除非特定迭代器的类型实现了一些特定的方法,使之成为可能)。
通常,只能通过使用迭代器来计数迭代器项。最有效的方法之一:
import itertools
from collections import deque
def count_iter_items(iterable):
"""
Consume an iterable not reading it into memory; return the number of items.
"""
counter = itertools.count()
deque(itertools.izip(iterable, counter), maxlen=0) # (consume at C speed)
return next(counter)
(对于Python 3。X替换itertools。Izip with zip)。
其他回答
这在理论上是不可能的:事实上,这就是“停止问题”。
证明
相反,假设可以使用函数len(g)来确定任何生成器g的长度(或无限长度)。
对于任何程序P,现在让我们将P转换为生成器g(P): 对于P中的每个返回点或出口点,产生一个值而不是返回它。
如果len(g(P)) ==无穷大,P不会停止。
这解决了暂停问题,这是不可能的,见维基百科。矛盾。
因此,如果不对泛型生成器进行迭代(==实际运行整个程序),就不可能对其元素进行计数。
更具体地说,考虑
def g():
while True:
yield "more?"
长度是无限的。这样的发生器有无穷多个。
一个简单的方法是使用内置函数set()或list():
答:set()在迭代器中没有重复项的情况下(最快的方式)
iter = zip([1,2,3],['a','b','c'])
print(len(set(iter)) # set(iter) = {(1, 'a'), (2, 'b'), (3, 'c')}
Out[45]: 3
or
iter = range(1,10)
print(len(set(iter)) # set(iter) = {1, 2, 3, 4, 5, 6, 7, 8, 9}
Out[47]: 9
B: list()以防迭代器中有重复的项
iter = (1,2,1,2,1,2,1,2)
print(len(list(iter)) # list(iter) = [1, 2, 1, 2, 1, 2, 1, 2]
Out[49]: 8
# compare with set function
print(len(set(iter)) # set(iter) = {1, 2}
Out[51]: 2
def count_iter(iter):
sum = 0
for _ in iter: sum += 1
return sum
关于你最初的问题,答案仍然是,在Python中通常没有办法知道迭代器的长度。
Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I'm a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.
所以,对于那些想知道讨论总结的人。使用以下方法计算5000万长度生成器表达式的最终最高分:
len(列表(创)), Len ([_ for _ in gen]), Sum (1 for _ in gen), Ilen (gen) (from more_itertool), Reduce (c, i: c + 1, gen, 0),
按执行性能排序(包括内存消耗),会让你大吃一惊:
```
1: test_list.py: 8:0.492 KiB
gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))
('list, sec', 1.9684218849870376)
2: test_list_compr.py: 8:0.867 KiB
gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])
('list_compr, sec', 2.5885991149989422)
3: test_sum.py:8: 0.859 KiB
gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()
('sum, sec', 3.441088170016883)
4: more_itertools/more.py:413: 1.266 KiB
d = deque(enumerate(iterable, 1), maxlen=1)
test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)
(ilen, sec, 9.812256851990242)
5: test_reduce.py:8: 0.859 KiB
gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)
('reduce, sec', 13.436614598002052) ' ' '
因此,len(list(gen))是使用频率最高且占用内存较少的