如何在Python中廉价地获得一个大文件的行数?

如何以最有效的内存和时间方式获取大文件的行数?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

当前回答

打开一个文件的结果是一个迭代器，它可以转换为一个序列，它有一个长度:

with open(filename) as f:
   return len(list(f))

这比显式循环更简洁，并避免了枚举。

2009-05-10 11:35:26

其他回答

没有比这更好的了。

毕竟，任何解决方案都必须读取整个文件，计算出有多少\n，并返回结果。

在不读取整个文件的情况下，你有更好的方法吗?不确定……最好的解决方案总是I/ o受限，你能做的最好的就是确保不使用不必要的内存，但看起来你已经覆盖了这个问题。

2009-05-10 10:37:42

这是对其他一些答案的元评论。

The line-reading and buffered \n-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's not b'\n'. In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to '\n'), while in binary mode only LF and CRLF will be counted if you count b'\n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days. The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage. You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes). The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.

2020-08-14 17:52:10

对我来说，这个变体是最快的:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

原因:缓冲比逐行和逐字符串读取快。计数也非常快

2009-05-10 11:29:12

这是我用的，看起来很干净:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

更新:这比使用纯python略快，但以内存使用为代价。子进程在执行您的命令时将派生一个与父进程具有相同内存占用的新进程。

2017-07-26 18:13:47

我得到了一个小(4-8%)的改进，这个版本重用了一个常量缓冲区，所以它应该避免任何内存或GC开销:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

您可以调整缓冲区大小，可能会看到一些改进。

2013-02-25 19:31:53

如何在Python中廉价地获得一个大文件的行数?

推荐文章

最新文章

标签