df = pd.read_csv('somefile.csv')

...给出一个错误:

熊猫…/网站/ / io / parsers.py: 1130: DtypeWarning:列(4,5,7,16)为混合类型。指定dtype 选项导入或设置low_memory=False。

为什么dtype选项与low_memory相关,为什么low_memory=False帮助?


当前回答

正如错误所示,在使用read_csv()方法时应该指定数据类型。 所以,你应该写

file = pd.read_csv('example.csv', dtype='unicode')

其他回答

根据pandas文档,只要engine='c'(这是默认值),指定low_memory=False是这个问题的合理解决方案。

如果low_memory=False,则将首先读入整个列,然后确定正确的类型。例如,列将根据需要保存为对象(字符串)以保存信息。

If low_memory=True (the default), then pandas reads in the data in chunks of rows, then appends them together. Then some of the columns might look like chunks of integers and strings mixed up, depending on whether during the chunk pandas encountered anything that couldn't be cast to integer (say). This could cause problems later. The warning is telling you that this happened at least once in the read in, so you should be careful. Setting low_memory=False will use more memory but will avoid the problem.

就我个人而言,我认为low_memory=True是一个糟糕的默认值,但我工作的领域使用的小数据集比大数据集多得多,所以便利性比效率更重要。

下面的代码演示了一个示例,其中设置了low_memory=True,并且包含混合类型的列。它建立在@ fireynx的答案基础上

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

# make a big csv data file, following earlier approach by @firelynx
csvdata = """1,Alice
2,Bob
3,Caesar
"""

# we have to replicate the "integer column" user_id many many times to get
# pd.read_csv to actually chunk read. otherwise it just reads 
# the whole thing in one chunk, because it's faster, and we don't get any 
# "mixed dtype" issue. the 100000 below was chosen by experimentation.
csvdatafull = ""
for i in range(100000):
    csvdatafull = csvdatafull + csvdata
csvdatafull =  csvdatafull + "foobar,Cthlulu\n"
csvdatafull = "user_id,username\n" + csvdatafull

sio = StringIO(csvdatafull)
# the following line gives me the warning:
    # C:\Users\rdisa\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
    # interactivity=interactivity, compiler=compiler, result=result)
# but it does not always give me the warning, so i guess the internal workings of read_csv depend on background factors
x = pd.read_csv(sio, low_memory=True) #, dtype={"user_id": int, "username": "string"})

x.dtypes
# this gives:
# Out[69]: 
# user_id     object
# username    object
# dtype: object

type(x['user_id'].iloc[0]) # int
type(x['user_id'].iloc[1]) # int
type(x['user_id'].iloc[2]) # int
type(x['user_id'].iloc[10000]) # int
type(x['user_id'].iloc[299999]) # str !!!! (even though it's a number! so this chunk must have been read in as strings)
type(x['user_id'].iloc[300000]) # str !!!!!

旁白:举个例子说明这是一个问题(也是我第一次遇到这个严重问题的地方),假设你在一个文件上运行了pd.read_csv(),然后想要根据一个标识符删除副本。比如标识符有时是数字,有时是字符串。一行可能是“81287”,另一行可能是“97324-32”。不过,它们是唯一的标识。

如果使用low_memory=True, pandas可能会像这样读取标识符列:

81287
81287
81287
81287
81287
"81287"
"81287"
"81287"
"81287"
"97324-32"
"97324-32"
"97324-32"
"97324-32"
"97324-32"

因为它把东西分成很多块,有时标识符81287是数字,有时是字符串。当我试图基于此删除副本时,

81287 == "81287"
Out[98]: False

我在一个~400MB的文件中遇到了类似的问题。设置low_memory=False对我有用。首先做一些简单的事情,我会检查你的数据帧是否比你的系统内存大,重新启动,在继续之前清理RAM。如果你仍然遇到错误,请确保你的.csv文件是正确的,在Excel中快速查看并确保没有明显的损坏。损坏的原始数据会造成严重破坏。

正如fireynx前面提到的,如果显式指定了dtype,并且存在与该dtype不兼容的混合数据,则加载将崩溃。我使用了这样的转换器作为变通方法来更改数据类型不兼容的值,这样数据仍然可以加载。

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

它为我工作与low_memory = False同时导入一个数据帧。这就是所有对我有效的改变:

df = pd.read_csv('export4_16.csv',low_memory=False)
df = pd.read_csv('somefile.csv', low_memory=False)

这应该能解决问题。当从CSV中读取1.8M行时,我得到了完全相同的错误。