我有一个更复杂的情况,数据集有一个嵌套结构:
import json
data = '{"TextID":{"0":"0038f0569e","1":"003eb6998d","2":"006da49ea0"},"Summary":{"0":{"Crisis_Level":["c"],"Type":["d"],"Special_Date":["a"]},"1":{"Crisis_Level":["d"],"Type":["a","d"],"Special_Date":["a"]},"2":{"Crisis_Level":["d"],"Type":["a"],"Special_Date":["a"]}}}'
df = pd.DataFrame.from_dict(json.loads(data))
print(df)
输出:
TextID Summary
0 0038f0569e {'Crisis_Level': ['c'], 'Type': ['d'], 'Specia...
1 003eb6998d {'Crisis_Level': ['d'], 'Type': ['a', 'd'], 'S...
2 006da49ea0 {'Crisis_Level': ['d'], 'Type': ['a'], 'Specia...
Summary列包含dict对象,所以我使用apply和from_dict和stack来提取每一行的dict:
df2 = df.apply(
lambda x: pd.DataFrame.from_dict(x[1], orient='index').stack(), axis=1)
print(df2)
输出:
Crisis_Level Special_Date Type
0 0 0 1
0 c a d NaN
1 d a a d
2 d a a NaN
看起来不错,但缺少TextID列。为了得到TextID列回来,我尝试了三种方法:
Modify apply to return multiple columns:
df_tmp = df.copy()
df_tmp[['TextID', 'Summary']] = df.apply(
lambda x: pd.Series([x[0], pd.DataFrame.from_dict(x[1], orient='index').stack()]), axis=1)
print(df_tmp)
output:
TextID Summary
0 0038f0569e Crisis_Level 0 c
Type 0 d
Spec...
1 003eb6998d Crisis_Level 0 d
Type 0 a
...
2 006da49ea0 Crisis_Level 0 d
Type 0 a
Spec...
But this is not what I want, the Summary structure are flatten.
Use pd.concat:
df_tmp2 = pd.concat([df['TextID'], df2], axis=1)
print(df_tmp2)
output:
TextID (Crisis_Level, 0) (Special_Date, 0) (Type, 0) (Type, 1)
0 0038f0569e c a d NaN
1 003eb6998d d a a d
2 006da49ea0 d a a NaN
Looks fine, the MultiIndex column structure are preserved as tuple. But check columns type:
df_tmp2.columns
output:
Index(['TextID', ('Crisis_Level', 0), ('Special_Date', 0), ('Type', 0),
('Type', 1)],
dtype='object')
Just as a regular Index class, not MultiIndex class.
use set_index:
Turn all columns you want to preserve into row index, after some complicated apply function and then reset_index to get columns back:
df_tmp3 = df.set_index('TextID')
df_tmp3 = df_tmp3.apply(
lambda x: pd.DataFrame.from_dict(x[0], orient='index').stack(), axis=1)
df_tmp3 = df_tmp3.reset_index(level=0)
print(df_tmp3)
output:
TextID Crisis_Level Special_Date Type
0 0 0 1
0 0038f0569e c a d NaN
1 003eb6998d d a a d
2 006da49ea0 d a a NaN
Check the type of columns
df_tmp3.columns
output:
MultiIndex(levels=[['Crisis_Level', 'Special_Date', 'Type', 'TextID'], [0, 1, '']],
codes=[[3, 0, 1, 2, 2], [2, 0, 0, 0, 1]])
因此,如果apply函数将返回MultiIndex列,并且希望保留它,则可能需要尝试第三种方法。