如何以最有效的内存和时间方式获取大文件的行数?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
如何以最有效的内存和时间方式获取大文件的行数?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
当前回答
对我来说,这个变体是最快的:
#!/usr/bin/env python
def main():
f = open('filename')
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
print lines
if __name__ == '__main__':
main()
原因:缓冲比逐行和逐字符串读取快。计数也非常快
其他回答
您可以执行子进程并运行wc -l filename
import subprocess
def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])
我使用的最简单和最短的方法是:
f = open("my_file.txt", "r")
len(f.readlines())
这是对其他一些答案的元评论。
The line-reading and buffered \n-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's not b'\n'. In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to '\n'), while in binary mode only LF and CRLF will be counted if you count b'\n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days. The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage. You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes). The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.
我相信内存映射文件将是最快的解决方案。我尝试了四个函数:由OP发布的函数(opcount);对文件中的行进行简单迭代(simplecount);带有内存映射字段(mmap)的Readline (mapcount);以及Mykola Kharechko (buffcount)提供的缓冲区读取解决方案。
我将每个函数运行五次,并计算出120万在线文本文件的平均运行时间。
Windows XP, Python 2.5, 2GB RAM, 2ghz AMD处理器
以下是我的结果:
mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714
编辑:Python 2.6的数字:
mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297
因此,对于Windows/Python 2.6,缓冲区读取策略似乎是最快的
代码如下:
from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict
def mapcount(filename):
f = open(filename, "r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines
def simplecount(filename):
lines = 0
for line in open(filename):
lines += 1
return lines
def bufcount(filename):
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
def opcount(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
counts = defaultdict(list)
for i in range(5):
for func in [mapcount, simplecount, bufcount, opcount]:
start_time = time.time()
assert func("big_file.txt") == 1209138
counts[func].append(time.time() - start_time)
for key, vals in counts.items():
print key.__name__, ":", sum(vals) / float(len(vals))
在perfplot分析之后,必须推荐缓冲读取解决方案
def buf_count_newlines_gen(fname):
def _make_gen(reader):
while True:
b = reader(2 ** 16)
if not b: break
yield b
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
它速度快,内存效率高。大多数其他解决方案大约要慢20倍。
代码重现情节:
import mmap
import subprocess
from functools import partial
import perfplot
def setup(n):
fname = "t.txt"
with open(fname, "w") as f:
for i in range(n):
f.write(str(i) + "\n")
return fname
def for_enumerate(fname):
i = 0
with open(fname) as f:
for i, _ in enumerate(f):
pass
return i + 1
def sum1(fname):
return sum(1 for _ in open(fname))
def mmap_count(fname):
with open(fname, "r+") as f:
buf = mmap.mmap(f.fileno(), 0)
lines = 0
while buf.readline():
lines += 1
return lines
def for_open(fname):
lines = 0
for _ in open(fname):
lines += 1
return lines
def buf_count_newlines(fname):
lines = 0
buf_size = 2 ** 16
with open(fname) as f:
buf = f.read(buf_size)
while buf:
lines += buf.count("\n")
buf = f.read(buf_size)
return lines
def buf_count_newlines_gen(fname):
def _make_gen(reader):
b = reader(2 ** 16)
while b:
yield b
b = reader(2 ** 16)
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
def wc_l(fname):
return int(subprocess.check_output(["wc", "-l", fname]).split()[0])
def sum_partial(fname):
with open(fname) as f:
count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
return count
def read_count(fname):
return open(fname).read().count("\n")
b = perfplot.bench(
setup=setup,
kernels=[
for_enumerate,
sum1,
mmap_count,
for_open,
wc_l,
buf_count_newlines,
buf_count_newlines_gen,
sum_partial,
read_count,
],
n_range=[2 ** k for k in range(27)],
xlabel="num lines",
)
b.save("out.png")
b.show()