一般来说,有没有一种有效的方法可以知道Python中的迭代器中有多少个元素,而不用遍历每个元素并计数?


当前回答

我决定在现代版本的Python上重新运行基准测试,并发现几乎完全颠倒了基准测试

我运行了以下命令:

py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return len(tuple(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return len(list(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return sum(map(lambda i: 1, x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return sum(1 for _ in x)" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  d = deque(enumerate(x, 1), maxlen=1)" -s "  return d[0][0] if d else 0" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  counter = count()" -s "  deque(zip(x, counter), maxlen=0)" -s "  return next(counter)" -- "itlen(it)"

它们等价于为以下每个itlen*(it)函数计时:

it = iter(range(1000000))
from collections import deque
from itertools import count

def itlen1(x):
  return len(tuple(x))
def itlen2(x):
  return len(list(x))
def itlen3(x):
  return sum(map(lambda i: 1, x))
def itlen4(x):
  return sum(1 for _ in x)
def itlen5(x):
  d = deque(enumerate(x, 1), maxlen=1)
  return d[0][0] if d else 0
def itlen6(x):
  counter = count()
  deque(zip(x, counter), maxlen=0)
  return next(counter)

在装有AMD Ryzen 7 5800H和16 GB RAM的Windows 11、Python 3.11机器上,我得到了以下输出:

10000000 loops, best of 5: 103 nsec per loop
10000000 loops, best of 5: 107 nsec per loop
10000000 loops, best of 5: 138 nsec per loop
10000000 loops, best of 5: 164 nsec per loop
10000000 loops, best of 5: 338 nsec per loop
10000000 loops, best of 5: 425 nsec per loop

这表明len(list(x))和len(tuple(x))是绑定的;后面跟着sum(map(lambda i: 1, x));然后紧靠sum(1 for _ in x);那么其他答案中提到的其他更复杂的方法和/或在基数中使用的方法至少要慢两倍。

其他回答

不能(除非特定迭代器的类型实现了一些特定的方法,使之成为可能)。

通常,只能通过使用迭代器来计数迭代器项。最有效的方法之一:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(对于Python 3。X替换itertools。Izip with zip)。

关于你最初的问题,答案仍然是,在Python中通常没有办法知道迭代器的长度。

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I'm a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.

我喜欢这个基数包,它是非常轻量级的,并尝试使用最快的实现,这取决于可迭代对象。

用法:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

count()的实际实现如下:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

假设,您希望在不遍历的情况下计算项的数量,这样迭代器就不会耗尽,稍后可以再次使用它。这是可能的复制或深度复制

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

输出是“查找长度没有耗尽迭代器!”

可选的(并且不明智的),你可以像下面这样为内置的len函数添加阴影:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

迭代器只是一个对象,它有一个指向下一个对象的指针,由某种缓冲区或流读取,它就像一个LinkedList,在那里你不知道你有多少东西,直到你遍历它们。迭代器是高效的,因为它们所做的一切都是通过引用而不是使用索引告诉你下一个是什么(但是正如你所看到的,你失去了查看下一个条目有多少的能力)。