一般来说,有没有一种有效的方法可以知道Python中的迭代器中有多少个元素,而不用遍历每个元素并计数?
当前回答
在计算机上有两种方法来获取“某物”的长度。
第一种方法是存储一个计数——这需要任何接触文件/数据的东西来修改它(或者一个只公开接口的类——但归根结底是一样的)。
另一种方法是遍历它并计算它有多大。
其他回答
关于你最初的问题,答案仍然是,在Python中通常没有办法知道迭代器的长度。
Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I'm a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.
通常的做法是将这类信息放在文件头中,并让pysam允许您访问这些信息。我不知道格式,但是你检查过API了吗?
正如其他人所说,你不能从迭代器中知道长度。
这段代码应该工作:
>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50
尽管它确实遍历每一项并计算它们,但这是最快的方法。
它也适用于迭代器中没有项的情况:
>>> sum(1 for _ in range(0))
0
当然,对于一个无限的输入,它会一直运行,所以请记住迭代器可以是无限的:
>>> sum(1 for _ in itertools.count())
[nothing happens, forever]
此外,请注意,这样做将耗尽迭代器,并且进一步尝试使用它将看不到任何元素。这是Python迭代器设计的一个不可避免的结果。如果你想保留元素,你就必须把它们存储在一个列表或其他东西中。
我决定在现代版本的Python上重新运行基准测试,并发现几乎完全颠倒了基准测试
我运行了以下命令:
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s " return len(tuple(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s " return len(list(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s " return sum(map(lambda i: 1, x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s " return sum(1 for _ in x)" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s " d = deque(enumerate(x, 1), maxlen=1)" -s " return d[0][0] if d else 0" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s " counter = count()" -s " deque(zip(x, counter), maxlen=0)" -s " return next(counter)" -- "itlen(it)"
它们等价于为以下每个itlen*(it)函数计时:
it = iter(range(1000000))
from collections import deque
from itertools import count
def itlen1(x):
return len(tuple(x))
def itlen2(x):
return len(list(x))
def itlen3(x):
return sum(map(lambda i: 1, x))
def itlen4(x):
return sum(1 for _ in x)
def itlen5(x):
d = deque(enumerate(x, 1), maxlen=1)
return d[0][0] if d else 0
def itlen6(x):
counter = count()
deque(zip(x, counter), maxlen=0)
return next(counter)
在装有AMD Ryzen 7 5800H和16 GB RAM的Windows 11、Python 3.11机器上,我得到了以下输出:
10000000 loops, best of 5: 103 nsec per loop
10000000 loops, best of 5: 107 nsec per loop
10000000 loops, best of 5: 138 nsec per loop
10000000 loops, best of 5: 164 nsec per loop
10000000 loops, best of 5: 338 nsec per loop
10000000 loops, best of 5: 425 nsec per loop
这表明len(list(x))和len(tuple(x))是绑定的;后面跟着sum(map(lambda i: 1, x));然后紧靠sum(1 for _ in x);那么其他答案中提到的其他更复杂的方法和/或在基数中使用的方法至少要慢两倍。
虽然一般情况下不可能按照要求去做,但在迭代了多少项之后,对它们进行迭代的次数进行计数通常仍然是有用的。为此,您可以使用jaraco.itertools.Counter或类似的方法。下面是一个使用python3和rwt加载包的例子。
$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
... for i in range(n):
... if random.randint(0, 1) == 0:
... yield i
...
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48