如何以最有效的内存和时间方式获取大文件的行数?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

当前回答

def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count

其他回答

这个呢

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()

对我来说,这个变体是最快的:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

原因:缓冲比逐行和逐字符串读取快。计数也非常快

为了完成上述方法,我尝试了fileinput模块的一个变体:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

并将一个60mil行文件传递给上述所有方法:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

这让我有点惊讶,fileinput是如此糟糕,比所有其他方法都要糟糕得多…

这是我用的,看起来很干净:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

更新:这比使用纯python略快,但以内存使用为代价。子进程在执行您的命令时将派生一个与父进程具有相同内存占用的新进程。

下面这句话怎么样:

file_length = len(open('myfile.txt','r').read().split('\n'))

用这种方法在一个3900行的文件上计时只需要0.003秒

def c():
  import time
  s = time.time()
  file_length = len(open('myfile.txt','r').read().split('\n'))
  print time.time() - s