如何以最有效的内存和时间方式获取大文件的行数?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

当前回答

def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count

其他回答

这是我用纯python发现的最快的东西。 你可以通过设置buffer来使用任意大小的内存,不过在我的电脑上2**16似乎是一个最佳位置。

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

我在这里找到了答案为什么在c++中从stdin读取行要比Python慢得多?稍微调整了一下。这是一个非常好的阅读来理解如何快速计数行,尽管wc -l仍然比其他任何方法快75%。

已经有很多答案了,但不幸的是,它们中的大多数只是一个几乎不可优化的问题上的微型经济……

在我参与的几个项目中,行数是软件的核心功能,以最快的速度处理大量文件是至关重要的。

行数的主要瓶颈是I/O访问,因为您需要读取每一行以检测行返回字符,因此没有其他方法。第二个潜在的瓶颈是内存管理:一次加载的内存越多,处理的速度就越快,但与第一个瓶颈相比,这个瓶颈可以忽略不计。

因此,除了禁用gc收集和其他微管理技巧等微小优化外,还有3种主要方法可以减少行计数函数的处理时间:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts. Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can pre-process them, first convert the line return to unix style (\n) as this will save 1 character compared to Windows or MacOS styles (not a big save but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose. Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

如果这些都不是选择,那么你只能依靠微观管理技巧来提高行数函数的速度,但不要指望有什么真正重要的东西。相反,您可以预期,与您将看到的速度改进回报相比,您花费在调整上的时间将是不均衡的。

没有比这更好的了。

毕竟,任何解决方案都必须读取整个文件,计算出有多少\n,并返回结果。

在不读取整个文件的情况下,你有更好的方法吗?不确定……最好的解决方案总是I/ o受限,你能做的最好的就是确保不使用不必要的内存,但看起来你已经覆盖了这个问题。

def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count

我得到了一个小(4-8%)的改进,这个版本重用了一个常量缓冲区,所以它应该避免任何内存或GC开销:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

您可以调整缓冲区大小,可能会看到一些改进。