我有一个熊猫数据框架与一列:
import pandas as pd
df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
teams
0 [SF, NYG]
1 [SF, NYG]
2 [SF, NYG]
3 [SF, NYG]
4 [SF, NYG]
5 [SF, NYG]
6 [SF, NYG]
如何将这列列表分成两列?
预期的结果:
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
如果有人来这里找现成的函数,我写了一个。
如果列没有指定,它会找到所有带有列表的列并展开它们;
添加的列命名为column_name_0、column_name_1等;
列的顺序在最终的数据框架中保持不变;
if strict=True,它检查给定列中的列表是否大小相等。
感谢改进和评论。
def unfold_columns(df, columns=[], strict=False):
assert isinstance(columns, list), "Columns should be a list of column names"
if len(columns) == 0:
columns = [
column for column in df.columns
if df.applymap(lambda x: isinstance(x, list)).all()[column]
]
else:
assert(all([(column in df.columns) for column in columns])), \
"Not all given columns are found in df"
columns_order = df.columns
for column_name in columns:
if df[column_name].apply(lambda x: isinstance(x, list)).all():
if strict:
assert len(set(df[column_name].apply(lambda x: len(x)))) == 1, \
f"Lists in df['{column_name}'] are not of equal length"
unfolded = pd.DataFrame(df[column_name].tolist())
unfolded.columns = [f'{column_name}_{x}' for x in unfolded.columns]
columns_order = [
*columns_order[:list(columns_order).index(column_name)],
*unfolded.columns,
*columns_order[list(columns_order).index(column_name)+1:]
]
df = df.join(unfolded).drop([column_name], axis=1)
return df[columns_order]
更简单的解决方案:
pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])
产量,
team1 team2
-------------
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
7 SF NYG
如果你想拆分一列带分隔符的字符串而不是列表,你可以类似地做:
pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
columns=['team1', 'team2'])
我想推荐一种更有效的python方法。
首先定义DataFrame作为原始post:
df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
我的解决方案:
%%timeit
df['team1'], df['team2'] = zip(*list(df['teams'].values))
>> 761 µs ± 8.35 µs per loop
相比之下,获得最多好评的解决方案是:
%%timeit
df[['team1','team2']] = pd.DataFrame(df.teams.tolist(), index=df.index)
df = pd.DataFrame(df['teams'].to_list(), columns=['team1','team2'])
>> 1.31 ms ± 11.2 µs per loop
我的解决方案节省了40%的时间,而且时间短得多。您需要记住的唯一一件事是如何使用zip(*list)解压缩和重塑二维列表。