我如何能逐行读取大文本文件，而不将它们加载到内存?

我想逐行读取一个大文件(>5GB)，而不将其全部内容加载到内存中。我不能使用readlines()，因为它在内存中创建了一个非常大的列表。

当前回答

老派方法:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()

2011-06-25 02:31:27

其他回答

blaze项目在过去6年里取得了长足的进展。它有一个简单的API，涵盖了pandas功能的一个有用子集。

dask。Dataframe内部负责分块，支持许多可并行操作，并允许您轻松地将切片导出回pandas，以便在内存中操作。

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

2018-01-22 20:51:11

我不敢相信这能像@john-la-rooy的回答看起来那么简单。因此，我使用逐行读写重新创建了cp命令。这是疯狂的快。

#!/usr/bin/env python3.6

import sys

with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)

2017-08-10 21:48:08

下面是加载任何大小的文本文件而不会导致内存问题的代码。它支持千兆字节大小的文件

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

下载文件data_loading_utils.py并将其导入到代码中

使用

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(data, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

Process_lines方法是回调函数。它将对所有行调用，参数数据每次表示文件的一行。

您可以根据您的机器硬件配置来配置变量CHUNK_SIZE。

2018-07-25 02:32:16

这是我找到的最佳解决方案，我在330 MB的文件上尝试了一下。

lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
    file.seek(lineno * (line_length + 2))
    print(file.readline(), end='')

其中line_length是单行中的字符数。例如，“abcd”的行长为4。

我添加了2个行长来跳过'\n'字符并移动到下一个字符。

2020-05-02 12:46:16

当您希望并行工作并只读取数据块，但要用新行保持数据整洁时，这可能很有用。

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data

2019-05-10 12:00:04

我如何能逐行读取大文本文件，而不将它们加载到内存?

推荐文章

最新文章

标签