如何以最有效的内存和时间方式获取大文件的行数?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
如何以最有效的内存和时间方式获取大文件的行数?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
当前回答
你可以使用操作系统。路径模块如下所示:
import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )
,其中Filename是文件的绝对路径。
其他回答
一句话解决方案:
import os
os.system("wc -l filename")
我的代码片段:
>>> os.system('wc -l *.txt')
0 bar.txt
1000 command.txt
3 test_file.txt
1003 total
这是对其他一些答案的元评论。
The line-reading and buffered \n-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's not b'\n'. In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to '\n'), while in binary mode only LF and CRLF will be counted if you count b'\n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days. The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage. You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes). The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.
下面这句话怎么样:
file_length = len(open('myfile.txt','r').read().split('\n'))
用这种方法在一个3900行的文件上计时只需要0.003秒
def c():
import time
s = time.time()
file_length = len(open('myfile.txt','r').read().split('\n'))
print time.time() - s
使用Numba
我们可以使用Numba来JIT(及时)编译我们的函数到机器代码。Def numbacountparallel(fname)运行速度快2.8倍 然后从问题中定义file_len(fname)。
注:
在运行基准测试之前,操作系统已经将文件缓存到内存中,因为我在我的PC上没有看到太多的磁盘活动。 第一次读取文件时,时间会慢得多,因此使用Numba的时间优势并不显著。
第一次调用函数时,JIT编译需要额外的时间。
如果我们不只是计算行数,这个就很有用了。
Cython是另一个选择。
http://numba.pydata.org/
结论
因为计算行数是IO绑定的,所以使用问题中的def file_len(fname),除非你想做的不仅仅是计算行数。
import timeit
from numba import jit, prange
import numpy as np
from itertools import (takewhile,repeat)
FILE = '../data/us_confirmed.csv' # 40.6MB, 371755 line file
CR = ord('\n')
# Copied from the question above. Used as a benchmark
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
# Copied from another answer. Used as a benchmark
def rawincount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.read(1024*1024*10) for _ in repeat(None)))
return sum( buf.count(b'\n') for buf in bufgen )
# Single thread
@jit(nopython=True)
def numbacountsingle_chunk(bs):
c = 0
for i in range(len(bs)):
if bs[i] == CR:
c += 1
return c
def numbacountsingle(filename):
f = open(filename, "rb")
total = 0
while True:
chunk = f.read(1024*1024*10)
lines = numbacountsingle_chunk(chunk)
total += lines
if not chunk:
break
return total
# Multi thread
@jit(nopython=True, parallel=True)
def numbacountparallel_chunk(bs):
c = 0
for i in prange(len(bs)):
if bs[i] == CR:
c += 1
return c
def numbacountparallel(filename):
f = open(filename, "rb")
total = 0
while True:
chunk = f.read(1024*1024*10)
lines = numbacountparallel_chunk(np.frombuffer(chunk, dtype=np.uint8))
total += lines
if not chunk:
break
return total
print('numbacountparallel')
print(numbacountparallel(FILE)) # This allows Numba to compile and cache the function without adding to the time.
print(timeit.Timer(lambda: numbacountparallel(FILE)).timeit(number=100))
print('\nnumbacountsingle')
print(numbacountsingle(FILE))
print(timeit.Timer(lambda: numbacountsingle(FILE)).timeit(number=100))
print('\nfile_len')
print(file_len(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))
print('\nrawincount')
print(rawincount(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))
每个函数调用100次的时间(以秒为单位)
numbacountparallel
371755
2.8007332000000003
numbacountsingle
371755
3.1508585999999994
file_len
371755
6.7945494
rawincount
371755
6.815438
我会使用Python的文件对象方法readlines,如下所示:
with open(input_file) as foo:
lines = len(foo.readlines())
这将打开文件,在文件中创建一个行列表,计算列表的长度,将其保存到一个变量中,然后再次关闭文件。