我有一个熊猫数据框架与一列:
import pandas as pd
df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
teams
0 [SF, NYG]
1 [SF, NYG]
2 [SF, NYG]
3 [SF, NYG]
4 [SF, NYG]
5 [SF, NYG]
6 [SF, NYG]
如何将这列列表分成两列?
预期的结果:
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
我想推荐一种更有效的python方法。
首先定义DataFrame作为原始post:
df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
我的解决方案:
%%timeit
df['team1'], df['team2'] = zip(*list(df['teams'].values))
>> 761 µs ± 8.35 µs per loop
相比之下,获得最多好评的解决方案是:
%%timeit
df[['team1','team2']] = pd.DataFrame(df.teams.tolist(), index=df.index)
df = pd.DataFrame(df['teams'].to_list(), columns=['team1','team2'])
>> 1.31 ms ± 11.2 µs per loop
我的解决方案节省了40%的时间,而且时间短得多。您需要记住的唯一一件事是如何使用zip(*list)解压缩和重塑二维列表。
这是另一个使用df的解。Transform和df.set_index:
>>> from operator import itemgetter
>>> df['teams'].transform({'item1': itemgetter(0), 'item2': itemgetter(1)})
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
当然可以概括为:
>>> indices = range(len(df['teams'][0]))
>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
这种方法具有提取所需指标的额外好处:
>>> df
teams
0 [SF, NYG, XYZ, ABC]
1 [SF, NYG, XYZ, ABC]
2 [SF, NYG, XYZ, ABC]
3 [SF, NYG, XYZ, ABC]
4 [SF, NYG, XYZ, ABC]
5 [SF, NYG, XYZ, ABC]
6 [SF, NYG, XYZ, ABC]
>>> indices = [0, 2]
>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})
team1 team3
0 SF XYZ
1 SF XYZ
2 SF XYZ
3 SF XYZ
4 SF XYZ
5 SF XYZ
6 SF XYZ
更简单的解决方案:
pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])
产量,
team1 team2
-------------
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
7 SF NYG
如果你想拆分一列带分隔符的字符串而不是列表,你可以类似地做:
pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
columns=['team1', 'team2'])
根据前面的回答,下面是另一个解决方案,它返回与df2.teams.apply(pd.Series)相同的结果,但运行时间要快得多:
pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)
计时:
In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)
In [2]: %timeit df2['teams'].apply(pd.Series)
8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)
35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)