使用pandas将一列字典拆分为单独的列

我有数据保存在postgreSQL数据库。我正在使用Python2.7查询这些数据，并将其转换为Pandas DataFrame。但是，这个数据帧的最后一列有一个值字典。DataFrame df看起来是这样的:

Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

我需要把这个列分割成单独的列，这样DataFrame ' df2看起来就像这样:

Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

我遇到的主要问题是列表的长度不一样。但是所有的列表只包含3个相同的值:'a'， 'b'和'c'。而且它们总是以相同的顺序出现('a'第一，'b'第二，'c'第三)。

下面的代码用来工作并返回我想要的(df2)。

objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)

我刚刚在上周运行了这段代码，它工作得很好。但是现在我的代码坏了，我从行[4]得到这个错误:

IndexError: out-of-bounds on slice (end)

我没有修改代码，但现在得到了错误。我觉得这是由于我的方法不健全或不恰当。

任何关于如何将这列列表拆分为单独的列的建议或指导将非常感谢!

编辑:我认为.tolist()和.apply方法在我的代码上不起作用，因为它是一个Unicode字符串，即:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

此格式为从postgreSQL数据库导入数据。在这个问题上有什么帮助或想法吗?有没有办法转换Unicode?

当前回答

my_df = pd.DataFrame.from_dict(my_dict, orient='index'， columns=['my_col'])

. .将正确地解析字典(将每个字典键放入单独的df列，键值放入df行)，因此字典将不会首先被压缩到单个列中。

2020-06-26 10:32:53

其他回答

你可以用pop + tolist来使用join。性能与使用drop + tolist的concat相当，但有些人可能会发现这样的语法更干净:

res = df.join(pd.DataFrame(df.pop('b').tolist()))

使用其他方法进行基准测试:

df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

def joris1(df):
    return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)

def joris2(df):
    return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)

def jpp(df):
    return df.join(pd.DataFrame(df.pop('b').tolist()))

df = pd.concat([df]*1000, ignore_index=True)

%timeit joris1(df.copy())  # 1.33 s per loop
%timeit joris2(df.copy())  # 7.42 ms per loop
%timeit jpp(df.copy())     # 7.68 ms per loop

2018-08-22 10:25:39

我强烈推荐提取“污染物”一栏的方法:

df_contaminants = pd.DataFrame(df[' contaminants '].values.tolist()， index=df.index)

它比

df_contaminants = df[' contaminants '].apply(pd.Series)

当df的值很大时。

2019-05-10 08:31:06

我知道这个问题很老了，但我是来寻找答案的。实际上现在有一个更好(更快)的方法来使用json_normalize:

import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])

这避免了昂贵的应用函数…

2019-03-26 11:21:32

根据Shijith在这个答案中执行的时间分析，最快的规范化一列扁平的单层字典的方法: df.join (pd.DataFrame (df.pop(污染物).values.tolist ())) 它不会解决下面提到的list或dicts列的其他问题，例如带有NaN或嵌套dicts的行。

pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series) See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply. Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type. If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…). Use pd.json_normalize to convert the dicts, with keys as headers and values for rows. There are additional parameters (e.g. record_path & meta) for dealing with nested dicts. Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join. pandas.DataFrame.pop is used to remove the specified column from the existing dataframe. This removes the need to drop the column later, using pandas.DataFrame.drop.

注意，如果列有任何NaN，则必须用空字典填充它们 df。污染物= df。Fillna ({i: {} for i in df.index}) 如果“污染物”列是字符串，则使用“{}”。另请参阅如何使用nan对列进行json_normalize。

import pandas as pd
from ast import literal_eval
import numpy as np

data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
        'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}

df = pd.DataFrame(data)

# display(df)
   Station ID                        Pollutants
0        8809  {"a": "46", "b": "3", "c": "12"}
1        8810   {"a": "36", "b": "5", "c": "8"}
2        8811              {"b": "2", "c": "7"}
3        8812                       {"c": "11"}
4        8813            {"a": "82", "c": "15"}
5        8814                               NaN

# check the type of the first value in Pollutants
>>> print(type(df.iloc[0, 1]))
<class 'str'>

# replace NaN with '{}' if the column is strings, otherwise replace with {}
df.Pollutants = df.Pollutants.fillna('{}')  # if the NaN is in a column of strings
# df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})  # if the column is not strings

# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)

# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True)  # uncomment if needed

# remove and normalize the column of dictionaries, and join the result to df
df = df.join(pd.json_normalize(df.pop('Pollutants')))

# display(df)
   Station ID    a    b    c
0        8809   46    3   12
1        8810   36    5    8
2        8811  NaN    2    7
3        8812  NaN  NaN   11
4        8813   82  NaN   15
5        8814  NaN  NaN  NaN

%%时间它

# dataframe with 1M rows
dfb = pd.concat([df]*20000).reset_index(drop=True)

%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
46.9 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
7.75 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2020-08-08 02:49:14

df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)

2016-07-07 23:33:56

使用pandas将一列字典拆分为单独的列

推荐文章

最新文章

标签