我想逐行读取一个大文件(>5GB),而不将其全部内容加载到内存中。我不能使用readlines(),因为它在内存中创建了一个非常大的列表。
当前回答
最好使用迭代器。 相关:fileinput -迭代多个输入流中的行。
从文档中可以看出:
import fileinput
for line in fileinput.input("filename", encoding="utf-8"):
process(line)
这将避免将整个文件一次复制到内存中。
其他回答
你所需要做的就是使用file对象作为迭代器。
for line in open("log.txt"):
do_something_with(line)
在最近的Python版本中使用上下文管理器更好。
with open("log.txt") as fileobject:
for line in fileobject:
do_something_with(line)
这也会自动关闭文件。
在文件对象上使用for循环逐行读取。使用open(…)让上下文管理器确保文件读取后关闭:
with open("log.txt") as infile:
for line in infile:
print(line)
我不敢相信这能像@john-la-rooy的回答看起来那么简单。因此,我使用逐行读写重新创建了cp命令。这是疯狂的快。
#!/usr/bin/env python3.6
import sys
with open(sys.argv[2], 'w') as outfile:
with open(sys.argv[1]) as infile:
for line in infile:
outfile.write(line)
我意识到这个问题在很久以前就已经回答过了,但是这里有一种并行的方法,而不会杀死您的内存开销(如果您试图将每一行放入池中,就会出现这种情况)。显然,将readJSON_line2函数替换为一些合理的函数——这只是为了说明这一点!
加速将取决于文件大小和你对每一行所做的事情-但最坏的情况是,对于一个小文件,只是用JSON阅读器读取它,我看到下面设置的性能与ST相似。
希望对大家有用:
def readJSON_line2(linesIn):
#Function for reading a chunk of json lines
'''
Note, this function is nonsensical. A user would never use the approach suggested
for reading in a JSON file,
its role is to evaluate the MT approach for full line by line processing to both
increase speed and reduce memory overhead
'''
import json
linesRtn = []
for lineIn in linesIn:
if lineIn.strip() != 0:
lineRtn = json.loads(lineIn)
else:
lineRtn = ""
linesRtn.append(lineRtn)
return linesRtn
# -------------------------------------------------------------------
if __name__ == "__main__":
import multiprocessing as mp
path1 = "C:\\user\\Documents\\"
file1 = "someBigJson.json"
nBuffer = 20*nCPUs # How many chunks are queued up (so cpus aren't waiting on processes spawning)
nChunk = 1000 # How many lines are in each chunk
#Both of the above will require balancing speed against memory overhead
iJob = 0 #Tracker for SMP jobs submitted into pool
iiJob = 0 #Tracker for SMP jobs extracted back out of pool
jobs = [] #SMP job holder
MTres3 = [] #Final result holder
chunk = []
iBuffer = 0 # Buffer line count
with open(path1+file1) as f:
for line in f:
#Send to the chunk
if len(chunk) < nChunk:
chunk.append(line)
else:
#Chunk full
#Don't forget to add the current line to chunk
chunk.append(line)
#Then add the chunk to the buffer (submit to SMP pool)
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#Clear the chunk for the next batch of entries
chunk = []
#Buffer is full, any more chunks submitted would cause undue memory overhead
#(Partially) empty the buffer
if iBuffer >= nBuffer:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iBuffer -=1
iiJob+=1
#Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
if chunk:
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#And gather up the last of the buffer, including the final chunk
while iiJob < iJob:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iiJob+=1
#Cleanup
del chunk, jobs, temp1
pool.close()
老派方法:
fh = open(file_name, 'rt')
line = fh.readline()
while line:
# do stuff with line
line = fh.readline()
fh.close()
推荐文章
- SQLAlchemy是否有与Django的get_or_create等价的函数?
- 如何将python datetime转换为字符串,具有可读格式的日期?
- 美丽的汤和提取div及其内容的ID
- 在Python中重置生成器对象
- 用Python构建最小的插件架构
- model.eval()在pytorch中做什么?
- Tensorflow 2.0:模块“Tensorflow”没有属性“Session”
- 从环境文件中读入环境变量
- 在OSX 10.11中安装Scrapy时,“OSError: [Errno 1]操作不允许”(El Capitan)(系统完整性保护)
- 如何删除熊猫数据帧的最后一行数据
- 我如何在熊猫中找到数字列?
- 检查pandas数据框架索引中是否存在值
- 计算熊猫数量的最有效方法是什么?
- 如何在python中验证日期字符串格式?
- 用csv模块从csv文件中读取特定的列?