如何以最有效的内存和时间方式获取大文件的行数?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

当前回答

我得到了一个小(4-8%)的改进,这个版本重用了一个常量缓冲区,所以它应该避免任何内存或GC开销:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

您可以调整缓冲区大小,可能会看到一些改进。

其他回答

与此答案类似的一行bash解决方案,使用了现代子进程。check_output功能:

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

这个怎么样?

import fileinput
import sys

counter=0
for line in fileinput.input([sys.argv[1]]):
    counter+=1

fileinput.close()
print counter
def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines

对我来说,这个变体是最快的:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

原因:缓冲比逐行和逐字符串读取快。计数也非常快

一句话解决方案:

import os
os.system("wc -l  filename")  

我的代码片段:

>>> os.system('wc -l *.txt')

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total