我试图读取一个大的csv文件(aprox。6 GB)在熊猫和我得到一个内存错误:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
有什么帮助吗?
我试图读取一个大的csv文件(aprox。6 GB)在熊猫和我得到一个内存错误:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
有什么帮助吗?
当前回答
解决方案1:
使用大数据的熊猫
解决方案2:
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
其他回答
我是这样说的:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
分块不应该总是解决这个问题的第一步。
Is the file large due to repeated non-numeric data or unwanted columns? If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter. Does your workflow require slicing, manipulating, exporting? If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API. If all else fails, read line by line via chunks. Chunk via pandas or via csv library as a last resort.
如果有人还在寻找这样的东西,我发现这个名为modin的新库可以提供帮助。它使用分布式计算来帮助读取。这里有一篇不错的文章将其功能与熊猫进行了比较。它本质上使用与熊猫相同的功能。
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
下面是一个例子:
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
该错误表明机器没有足够的内存来读取整个 CSV一次转换成一个数据帧。假设您不需要整个数据集 内存,避免这个问题的一种方法是处理CSV在 Chunks(通过指定chunksize参数):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
chunksize参数指定每个块的行数。 (当然,最后一个块可能包含少于块大小的行。)
熊猫>= 1.2
Read_csv with chunksize返回一个上下文管理器,像这样使用:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
参见 GH38225