我试图读取一个大的csv文件(aprox。6 GB)在熊猫和我得到一个内存错误:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
有什么帮助吗?
我试图读取一个大的csv文件(aprox。6 GB)在熊猫和我得到一个内存错误:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
有什么帮助吗?
当前回答
除了上面的答案,对于那些想要处理CSV然后导出到CSV、parquet或SQL的人来说,d6tstack是另一个不错的选择。您可以加载多个文件,它处理数据模式更改(添加/删除列)。核心支持已经被剔除。
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible
其他回答
如果你有一个csv文件,有数百万个数据条目,你想要加载完整的数据集,你应该使用dask_cudf,
import dask_cudf as dc
df = dc.read_csv("large_data.csv")
您可以将数据读入为块,并将每个块保存为pickle。
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
在下一步中,读入pickle并将每个pickle附加到所需的数据框架中。
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
除了上面的答案,对于那些想要处理CSV然后导出到CSV、parquet或SQL的人来说,d6tstack是另一个不错的选择。您可以加载多个文件,它处理数据模式更改(添加/删除列)。核心支持已经被剔除。
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible
该错误表明机器没有足够的内存来读取整个 CSV一次转换成一个数据帧。假设您不需要整个数据集 内存,避免这个问题的一种方法是处理CSV在 Chunks(通过指定chunksize参数):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
chunksize参数指定每个块的行数。 (当然,最后一个块可能包含少于块大小的行。)
熊猫>= 1.2
Read_csv with chunksize返回一个上下文管理器,像这样使用:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
参见 GH38225
分块不应该总是解决这个问题的第一步。
Is the file large due to repeated non-numeric data or unwanted columns? If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter. Does your workflow require slicing, manipulating, exporting? If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API. If all else fails, read line by line via chunks. Chunk via pandas or via csv library as a last resort.