如何以最有效的内存和时间方式获取大文件的行数?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

当前回答

def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count

其他回答

这是我用的,看起来很干净:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

更新:这比使用纯python略快,但以内存使用为代价。子进程在执行您的命令时将派生一个与父进程具有相同内存占用的新进程。

如果文件能放进内存,那么

with open(fname) as f:
    count = len(f.read().split(b'\n')) - 1

对我来说,这个变体是最快的:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

原因:缓冲比逐行和逐字符串读取快。计数也非常快

一行,可能很快:

num_lines = sum(1 for line in open('myfile.txt'))

如果你的文件中的所有行都是相同的长度(并且只包含ASCII字符)*,你可以非常便宜地执行以下操作:

fileSize     = os.path.getsize( pathToFile )  # file size in bytes
bytesPerLine = someInteger                    # don't forget to account for the newline character
numLines     = fileSize // bytesPerLine

*如果使用像é这样的unicode字符,我怀疑需要更多的努力来确定一行中的字节数。