如何以最有效的内存和时间方式获取大文件的行数?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
如何以最有效的内存和时间方式获取大文件的行数?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
当前回答
下面是一个python程序,使用多处理库将行计数分布到不同的机器/核。使用8核windows 64服务器,我的测试将一个2000万行文件的计数从26秒提高到7秒。注意:不使用内存映射会使运行速度变慢。
import multiprocessing, sys, time, os, mmap
import logging, logging.handlers
def init_logger(pid):
console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
logger = logging.getLogger() # New logger at root level
logger.setLevel( logging.INFO )
logger.handlers.append( logging.StreamHandler() )
logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )
def getFileLineCount( queues, pid, processes, file1 ):
init_logger(pid)
logging.info( 'start' )
physical_file = open(file1, "r")
# mmap.mmap(fileno, length[, tagname[, access[, offset]]]
m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )
#work out file size to divide up line counting
fSize = os.stat(file1).st_size
chunk = (fSize / processes) + 1
lines = 0
#get where I start and stop
_seedStart = chunk * (pid)
_seekEnd = chunk * (pid+1)
seekStart = int(_seedStart)
seekEnd = int(_seekEnd)
if seekEnd < int(_seekEnd + 1):
seekEnd += 1
if _seedStart < int(seekStart + 1):
seekStart += 1
if seekEnd > fSize:
seekEnd = fSize
#find where to start
if pid > 0:
m1.seek( seekStart )
#read next line
l1 = m1.readline() # need to use readline with memory mapped files
seekStart = m1.tell()
#tell previous rank my seek start to make their seek end
if pid > 0:
queues[pid-1].put( seekStart )
if pid < processes-1:
seekEnd = queues[pid].get()
m1.seek( seekStart )
l1 = m1.readline()
while len(l1) > 0:
lines += 1
l1 = m1.readline()
if m1.tell() > seekEnd or len(l1) == 0:
break
logging.info( 'done' )
# add up the results
if pid == 0:
for p in range(1,processes):
lines += queues[0].get()
queues[0].put(lines) # the total lines counted
else:
queues[0].put(lines)
m1.close()
physical_file.close()
if __name__ == '__main__':
init_logger( 'main' )
if len(sys.argv) > 1:
file_name = sys.argv[1]
else:
logging.fatal( 'parameters required: file-name [processes]' )
exit()
t = time.time()
processes = multiprocessing.cpu_count()
if len(sys.argv) > 2:
processes = int(sys.argv[2])
queues=[] # a queue for each process
for pid in range(processes):
queues.append( multiprocessing.Queue() )
jobs=[]
prev_pipe = 0
for pid in range(processes):
p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
p.start()
jobs.append(p)
jobs[0].join() #wait for counting to finish
lines = queues[0].get()
logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
其他回答
这是对其他一些答案的元评论。
The line-reading and buffered \n-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's not b'\n'. In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to '\n'), while in binary mode only LF and CRLF will be counted if you count b'\n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days. The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage. You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes). The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.
如果你的文件中的所有行都是相同的长度(并且只包含ASCII字符)*,你可以非常便宜地执行以下操作:
fileSize = os.path.getsize( pathToFile ) # file size in bytes
bytesPerLine = someInteger # don't forget to account for the newline character
numLines = fileSize // bytesPerLine
*如果使用像é这样的unicode字符,我怀疑需要更多的努力来确定一行中的字节数。
打开一个文件的结果是一个迭代器,它可以转换为一个序列,它有一个长度:
with open(filename) as f:
return len(list(f))
这比显式循环更简洁,并避免了枚举。
对我来说,这个变体是最快的:
#!/usr/bin/env python
def main():
f = open('filename')
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
print lines
if __name__ == '__main__':
main()
原因:缓冲比逐行和逐字符串读取快。计数也非常快
这是我用纯python发现的最快的东西。 你可以通过设置buffer来使用任意大小的内存,不过在我的电脑上2**16似乎是一个最佳位置。
from functools import partial
buffer=2**16
with open(myfile) as f:
print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))
我在这里找到了答案为什么在c++中从stdin读取行要比Python慢得多?稍微调整了一下。这是一个非常好的阅读来理解如何快速计数行,尽管wc -l仍然比其他任何方法快75%。