我有数据保存在postgreSQL数据库。我正在使用Python2.7查询这些数据,并将其转换为Pandas DataFrame。但是,这个数据帧的最后一列有一个值字典。DataFrame df看起来是这样的:

Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

我需要把这个列分割成单独的列,这样DataFrame ' df2看起来就像这样:

Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

我遇到的主要问题是列表的长度不一样。但是所有的列表只包含3个相同的值:'a', 'b'和'c'。而且它们总是以相同的顺序出现('a'第一,'b'第二,'c'第三)。

下面的代码用来工作并返回我想要的(df2)。

objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)

我刚刚在上周运行了这段代码,它工作得很好。但是现在我的代码坏了,我从行[4]得到这个错误:

IndexError: out-of-bounds on slice (end) 

我没有修改代码,但现在得到了错误。我觉得这是由于我的方法不健全或不恰当。

任何关于如何将这列列表拆分为单独的列的建议或指导将非常感谢!

编辑:我认为.tolist()和.apply方法在我的代码上不起作用,因为它是一个Unicode字符串,即:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

此格式为从postgreSQL数据库导入数据。在这个问题上有什么帮助或想法吗?有没有办法转换Unicode?


当前回答

根据Shijith在这个答案中执行的时间分析,最快的规范化一列扁平的单层字典的方法: df.join (pd.DataFrame (df.pop(污染物).values.tolist ())) 它不会解决下面提到的list或dicts列的其他问题,例如带有NaN或嵌套dicts的行。

pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series) See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply. Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type. If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…). Use pd.json_normalize to convert the dicts, with keys as headers and values for rows. There are additional parameters (e.g. record_path & meta) for dealing with nested dicts. Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join. pandas.DataFrame.pop is used to remove the specified column from the existing dataframe. This removes the need to drop the column later, using pandas.DataFrame.drop.

注意,如果列有任何NaN,则必须用空字典填充它们 df。污染物= df。Fillna ({i: {} for i in df.index}) 如果“污染物”列是字符串,则使用“{}”。 另请参阅如何使用nan对列进行json_normalize。

import pandas as pd
from ast import literal_eval
import numpy as np

data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
        'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}

df = pd.DataFrame(data)

# display(df)
   Station ID                        Pollutants
0        8809  {"a": "46", "b": "3", "c": "12"}
1        8810   {"a": "36", "b": "5", "c": "8"}
2        8811              {"b": "2", "c": "7"}
3        8812                       {"c": "11"}
4        8813            {"a": "82", "c": "15"}
5        8814                               NaN

# check the type of the first value in Pollutants
>>> print(type(df.iloc[0, 1]))
<class 'str'>

# replace NaN with '{}' if the column is strings, otherwise replace with {}
df.Pollutants = df.Pollutants.fillna('{}')  # if the NaN is in a column of strings
# df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})  # if the column is not strings

# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)

# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True)  # uncomment if needed

# remove and normalize the column of dictionaries, and join the result to df
df = df.join(pd.json_normalize(df.pop('Pollutants')))

# display(df)
   Station ID    a    b    c
0        8809   46    3   12
1        8810   36    5    8
2        8811  NaN    2    7
3        8812  NaN  NaN   11
4        8813   82  NaN   15
5        8814  NaN  NaN  NaN

%%时间它

# dataframe with 1M rows
dfb = pd.concat([df]*20000).reset_index(drop=True)

%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
46.9 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
7.75 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

其他回答

df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)

我强烈推荐提取“污染物”一栏的方法:

df_contaminants = pd.DataFrame(df[' contaminants '].values.tolist(), index=df.index)

它比

df_contaminants = df[' contaminants '].apply(pd.Series)

当df的值很大时。

一种解决方案如下:

>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
   Station ID    a    b   c
0        8809   46    3  12
1        8810   36    5   8
2        8811  NaN    2   7
3        8812  NaN  NaN  11
4        8813   82  NaN  15

my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])

. .将正确地解析字典(将每个字典键放入单独的df列,键值放入df行),因此字典将不会首先被压缩到单个列中。

如何用熊猫将一列字典拆分为单独的列?

pd.DataFrame(df['val'].tolist())是分解字典列的规范方法

这是用彩色图表证明的。

基准测试代码供参考。

请注意,我只是计算爆炸的时间,因为这是回答这个问题最有趣的部分——结果构造的其他方面(例如是否使用pop或drop)与讨论无关,可以忽略(但是应该注意的是,使用pop避免了后续的drop调用,因此最终的解决方案性能更好一些,但我们仍然在列出列并将其传递给pd。DataFrame)。

此外,pop破坏性地改变了输入DataFrame,使得在基准测试代码中运行变得更加困难,因为基准测试代码假设输入在测试运行中没有改变。


对其他解决方案的批评

df['val'].apply(pd.Series) is extremely slow for large N as pandas constructs Series objects for each row, then proceeds to construct a DataFrame from them. For larger N the performance dips to the order of minutes or hours. pd.json_normalize(df['val'])) is slower simply because json_normalize is meant to work with a much more complex input data - particularly deeply nested JSON with multiple record paths and metadata. We have a simple flat dict for which pd.DataFrame suffices, so use that if your dicts are flat. Some answers suggest df.pop('val').values.tolist() or df.pop('val').to_numpy().tolist(). I don't think it makes much of a difference whether you listify the series or the numpy array. It's one operation less to listify the series directly and really isn't slower so I'd recommend avoiding generating the numpy array in the intermediate step.