熊猫read_csv: low_memory和dtype选项

df = pd.read_csv('somefile.csv')

.．.给出一个错误:

熊猫…/网站/ / io / parsers.py: 1130: DtypeWarning:列(4,5,7,16)为混合类型。指定dtype 选项导入或设置low_memory=False。

为什么dtype选项与low_memory相关，为什么low_memory=False帮助?

当前回答

有时候，当其他方法都失败时，你只想告诉熊猫闭嘴:

# Ignore DtypeWarnings from pandas' read_csv                                                                                                                                                                                            
warnings.filterwarnings('ignore', message="^Columns.*")

2020-11-25 23:59:45

其他回答

根据Jerald Achaibar给出的答案，我们可以检测混合Dytpes警告，并且只在警告发生时使用较慢的python引擎:

import warnings

# Force mixed datatype warning to be a python error so we can catch it and reattempt the 
# load using the slower python engine
warnings.simplefilter('error', pandas.errors.DtypeWarning)
try:
    df = pandas.read_csv(path, sep=sep, encoding=encoding)
except pandas.errors.DtypeWarning:
    df = pandas.read_csv(path, sep=sep, encoding=encoding, engine="python")

2022-08-26 10:29:21

根据pandas文档，只要engine='c'(这是默认值)，指定low_memory=False是这个问题的合理解决方案。

如果low_memory=False，则将首先读入整个列，然后确定正确的类型。例如，列将根据需要保存为对象(字符串)以保存信息。

If low_memory=True (the default), then pandas reads in the data in chunks of rows, then appends them together. Then some of the columns might look like chunks of integers and strings mixed up, depending on whether during the chunk pandas encountered anything that couldn't be cast to integer (say). This could cause problems later. The warning is telling you that this happened at least once in the read in, so you should be careful. Setting low_memory=False will use more memory but will avoid the problem.

就我个人而言，我认为low_memory=True是一个糟糕的默认值，但我工作的领域使用的小数据集比大数据集多得多，所以便利性比效率更重要。

下面的代码演示了一个示例，其中设置了low_memory=True，并且包含混合类型的列。它建立在@ fireynx的答案基础上

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

# make a big csv data file, following earlier approach by @firelynx
csvdata = """1,Alice
2,Bob
3,Caesar
"""

# we have to replicate the "integer column" user_id many many times to get
# pd.read_csv to actually chunk read. otherwise it just reads 
# the whole thing in one chunk, because it's faster, and we don't get any 
# "mixed dtype" issue. the 100000 below was chosen by experimentation.
csvdatafull = ""
for i in range(100000):
    csvdatafull = csvdatafull + csvdata
csvdatafull =  csvdatafull + "foobar,Cthlulu\n"
csvdatafull = "user_id,username\n" + csvdatafull

sio = StringIO(csvdatafull)
# the following line gives me the warning:
    # C:\Users\rdisa\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
    # interactivity=interactivity, compiler=compiler, result=result)
# but it does not always give me the warning, so i guess the internal workings of read_csv depend on background factors
x = pd.read_csv(sio, low_memory=True) #, dtype={"user_id": int, "username": "string"})

x.dtypes
# this gives:
# Out[69]: 
# user_id     object
# username    object
# dtype: object

type(x['user_id'].iloc[0]) # int
type(x['user_id'].iloc[1]) # int
type(x['user_id'].iloc[2]) # int
type(x['user_id'].iloc[10000]) # int
type(x['user_id'].iloc[299999]) # str !!!! (even though it's a number! so this chunk must have been read in as strings)
type(x['user_id'].iloc[300000]) # str !!!!!

旁白:举个例子说明这是一个问题(也是我第一次遇到这个严重问题的地方)，假设你在一个文件上运行了pd.read_csv()，然后想要根据一个标识符删除副本。比如标识符有时是数字，有时是字符串。一行可能是“81287”，另一行可能是“97324-32”。不过，它们是唯一的标识。

如果使用low_memory=True, pandas可能会像这样读取标识符列:

因为它把东西分成很多块，有时标识符81287是数字，有时是字符串。当我试图基于此删除副本时，

81287 == "81287"
Out[98]: False

2020-07-19 01:36:30

Try:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

根据熊猫的文件:

dtype:列的类型名称或字典->类型

至于low_memory，默认为True，还没有文档。但我认为这无关紧要。错误消息是通用的，所以无论如何您都不需要处理low_memory。希望这对你有所帮助，如果你还有其他问题，请告诉我

2014-06-16 20:11:56

它为我工作与low_memory = False同时导入一个数据帧。这就是所有对我有效的改变:

df = pd.read_csv('export4_16.csv',low_memory=False)

2019-04-17 14:40:40

正如错误所示，在使用read_csv()方法时应该指定数据类型。所以，你应该写

file = pd.read_csv('example.csv', dtype='unicode')

2020-08-15 16:01:11

熊猫read_csv: low_memory和dtype选项

推荐文章

最新文章

标签