我有一个非常大的4GB文件,当我试图读取它时,我的电脑挂了。 所以我想一块一块地读取它,在处理每一块之后,将处理过的一块存储到另一个文件中,然后读取下一块。
有什么方法可以生产这些碎片吗?
我喜欢有一个懒惰的方法。
我有一个非常大的4GB文件,当我试图读取它时,我的电脑挂了。 所以我想一块一块地读取它,在处理每一块之后,将处理过的一块存储到另一个文件中,然后读取下一块。
有什么方法可以生产这些碎片吗?
我喜欢有一个懒惰的方法。
当前回答
我也有类似的情况。不清楚你是否知道以字节为单位的块大小;我通常不这样做,但所需要的记录(行)的数量是已知的:
def get_line():
with open('4gb_file') as file:
for i in file:
yield i
lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]
更新:谢谢nosklo。这就是我的意思。它几乎工作,除了它丢失了一行“之间”块。
chunk = [next(gen) for i in range(lines_required)]
做的把戏w/o失去任何线条,但它看起来不太好。
其他回答
f = ... # file-like object, i.e. supporting read(size) function and
# returning empty string '' when there is nothing to read
def chunked(file, chunk_size):
return iter(lambda: file.read(chunk_size), '')
for data in chunked(f, 65536):
# process the data
更新:该方法最好在https://stackoverflow.com/a/4566523/38592中解释
如果您的计算机、操作系统和python是64位的,那么您可以使用mmap模块将文件的内容映射到内存中,并使用索引和切片访问它。下面是文档中的一个例子:
import mmap
with open("hello.txt", "r+") as f:
# memory-map the file, size 0 means whole file
map = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print map.readline() # prints "Hello Python!"
# read content via slice notation
print map[:5] # prints "Hello"
# update content using slice notation;
# note that new content must have same size
map[6:] = " world!\n"
# ... and read again using standard file methods
map.seek(0)
print map.readline() # prints "Hello world!"
# close the map
map.close()
如果你的计算机、操作系统或python是32位的,那么映射大文件会占用你的大部分地址空间,并耗尽程序的内存。
参考python的官方文档https://docs.python.org/3/library/functions.html#iter
也许这个方法更python化:
"""A file object returned by open() is a iterator with
read method which could specify current read's block size
"""
with open('mydata.db', 'r') as f_in:
block_read = partial(f_in.read, 1024 * 1024)
block_iterator = iter(block_read, '')
for index, block in enumerate(block_iterator, start=1):
block = process_block(block) # process your block data
with open(f'{index}.txt', 'w') as f_out:
f_out.write(block)
已经有很多好的答案,但是如果您的整个文件都在一行上,并且您仍然想处理“行”(而不是固定大小的块),那么这些答案对您没有帮助。
99%的情况下,可以逐行处理文件。然后,正如回答中所建议的,你可以使用文件对象本身作为惰性生成器:
with open('big.csv') as f:
for line in f:
process(line)
但是,可能会遇到行分隔符不是'\n'(常见情况是'|')的非常大的文件。
在处理之前将'|'转换为'\n'可能不是一个选项,因为它可能会混淆可能合法包含'\n'的字段(例如自由文本用户输入)。 使用csv库也被排除在外,因为至少在lib的早期版本中,它是硬编码来逐行读取输入的。
对于这种情况,我创建了以下代码段[在2021年5月针对Python 3.8+更新]:
def rows(f, chunksize=1024, sep='|'):
"""
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
"""
row = ''
while (chunk := f.read(chunksize)) != '': # End of file
while (i := chunk.find(sep)) != -1: # No separator found
yield row + chunk[:i]
chunk = chunk[i+1:]
row = ''
row += chunk
yield row
[对于较旧版本的python]:
def rows(f, chunksize=1024, sep='|'):
"""
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
"""
curr_row = ''
while True:
chunk = f.read(chunksize)
if chunk == '': # End of file
yield curr_row
break
while True:
i = chunk.find(sep)
if i == -1:
break
yield curr_row + chunk[:i]
curr_row = ''
chunk = chunk[i+1:]
curr_row += chunk
我能够成功地使用它来解决各种问题。它已经通过了各种块大小的广泛测试。以下是我正在使用的测试套件,供那些需要说服自己的人使用:
test_file = 'test_file'
def cleanup(func):
def wrapper(*args, **kwargs):
func(*args, **kwargs)
os.unlink(test_file)
return wrapper
@cleanup
def test_empty(chunksize=1024):
with open(test_file, 'w') as f:
f.write('')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1_char_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1_char(chunksize=1024):
with open(test_file, 'w') as f:
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1025_chars_1_row(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1024_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1023):
f.write('a')
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1026
@cleanup
def test_2048_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_2049_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
if __name__ == '__main__':
for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
test_empty(chunksize)
test_1_char_2_rows(chunksize)
test_1_char(chunksize)
test_1025_chars_1_row(chunksize)
test_1024_chars_2_rows(chunksize)
test_1025_chars_1026_rows(chunksize)
test_2048_chars_2_rows(chunksize)
test_2049_chars_2_rows(chunksize)
您可以使用以下代码。
file_obj = open('big_file')
Open()返回一个文件对象
然后使用os。获取大小的数据
file_size = os.stat('big_file').st_size
for i in range( file_size/1024):
print file_obj.read(1024)