我有数据保存在postgreSQL数据库。我正在使用Python2.7查询这些数据,并将其转换为Pandas DataFrame。但是,这个数据帧的最后一列有一个值字典。DataFrame df看起来是这样的:
Station ID Pollutants
8809 {"a": "46", "b": "3", "c": "12"}
8810 {"a": "36", "b": "5", "c": "8"}
8811 {"b": "2", "c": "7"}
8812 {"c": "11"}
8813 {"a": "82", "c": "15"}
我需要把这个列分割成单独的列,这样DataFrame ' df2看起来就像这样:
Station ID a b c
8809 46 3 12
8810 36 5 8
8811 NaN 2 7
8812 NaN NaN 11
8813 82 NaN 15
我遇到的主要问题是列表的长度不一样。但是所有的列表只包含3个相同的值:'a', 'b'和'c'。而且它们总是以相同的顺序出现('a'第一,'b'第二,'c'第三)。
下面的代码用来工作并返回我想要的(df2)。
objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)
我刚刚在上周运行了这段代码,它工作得很好。但是现在我的代码坏了,我从行[4]得到这个错误:
IndexError: out-of-bounds on slice (end)
我没有修改代码,但现在得到了错误。我觉得这是由于我的方法不健全或不恰当。
任何关于如何将这列列表拆分为单独的列的建议或指导将非常感谢!
编辑:我认为.tolist()和.apply方法在我的代码上不起作用,因为它是一个Unicode字符串,即:
#My data format
u{'a': '1', 'b': '2', 'c': '3'}
#and not
{u'a': '1', u'b': '2', u'c': '3'}
此格式为从postgreSQL数据库导入数据。在这个问题上有什么帮助或想法吗?有没有办法转换Unicode?
要将字符串转换为实际的字典,可以执行df['污染物级别'].map(eval)。然后,可以使用下面的解决方案将字典转换为不同的列。
举个小例子,你可以使用.apply(pd.Series):
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
In [3]: df
Out[3]:
a b
0 1 {u'c': 1}
1 2 {u'd': 3}
2 3 {u'c': 5, u'd': 6}
In [4]: df['b'].apply(pd.Series)
Out[4]:
c d
0 1.0 NaN
1 NaN 3.0
2 5.0 6.0
为了将它与数据框架的其余部分结合起来,你可以将其他列与上面的结果连接起来:
In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
使用您的代码,如果我省略iloc部分,这也可以工作:
In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
根据Shijith在这个答案中执行的时间分析,最快的规范化一列扁平的单层字典的方法:
df.join (pd.DataFrame (df.pop(污染物).values.tolist ()))
它不会解决下面提到的list或dicts列的其他问题,例如带有NaN或嵌套dicts的行。
pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series)
See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply.
Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type.
If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…).
Use pd.json_normalize to convert the dicts, with keys as headers and values for rows.
There are additional parameters (e.g. record_path & meta) for dealing with nested dicts.
Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize
If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join.
pandas.DataFrame.pop is used to remove the specified column from the existing dataframe. This removes the need to drop the column later, using pandas.DataFrame.drop.
注意,如果列有任何NaN,则必须用空字典填充它们
df。污染物= df。Fillna ({i: {} for i in df.index})
如果“污染物”列是字符串,则使用“{}”。
另请参阅如何使用nan对列进行json_normalize。
import pandas as pd
from ast import literal_eval
import numpy as np
data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}
df = pd.DataFrame(data)
# display(df)
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
5 8814 NaN
# check the type of the first value in Pollutants
>>> print(type(df.iloc[0, 1]))
<class 'str'>
# replace NaN with '{}' if the column is strings, otherwise replace with {}
df.Pollutants = df.Pollutants.fillna('{}') # if the NaN is in a column of strings
# df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index}) # if the column is not strings
# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)
# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True) # uncomment if needed
# remove and normalize the column of dictionaries, and join the result to df
df = df.join(pd.json_normalize(df.pop('Pollutants')))
# display(df)
Station ID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
5 8814 NaN NaN NaN
%%时间它
# dataframe with 1M rows
dfb = pd.concat([df]*20000).reset_index(drop=True)
%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
46.9 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
7.75 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
你可以用pop + tolist来使用join。性能与使用drop + tolist的concat相当,但有些人可能会发现这样的语法更干净:
res = df.join(pd.DataFrame(df.pop('b').tolist()))
使用其他方法进行基准测试:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
def joris1(df):
return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)
def joris2(df):
return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
def jpp(df):
return df.join(pd.DataFrame(df.pop('b').tolist()))
df = pd.concat([df]*1000, ignore_index=True)
%timeit joris1(df.copy()) # 1.33 s per loop
%timeit joris2(df.copy()) # 7.42 ms per loop
%timeit jpp(df.copy()) # 7.68 ms per loop