我有一个pandas数据框架,其中一列文本字符串包含逗号分隔的值。我想拆分每个CSV字段,并为每个条目创建一个新行(假设CSV是干净的,只需要在','上拆分)。例如,a应该变成b:

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
Out[8]: 
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

到目前为止,我已经尝试了各种简单的函数,但是.apply方法在轴上使用时似乎只接受一行作为返回值,而且我不能让.transform工作。任何建议都将不胜感激!

示例数据:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

我知道这不会起作用,因为我们通过numpy丢失了DataFrame元数据,但它应该给你一个我试图做的感觉:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)

当前回答

我有一个类似的问题,我的解决方案是将数据帧转换为字典列表,然后进行转换。函数如下:

import re
import pandas as pd

def separate_row(df, column_name):
    ls = []
    for row_dict in df.to_dict('records'):
        for word in re.split(',', row_dict[column_name]):
            row = row_dict.copy()
            row[column_name]=word
            ls.append(row)
    return pd.DataFrame(ls)

例子:

>>> from pandas import DataFrame
>>> import numpy as np
>>> a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
>>> a
    var1  var2
0  a,b,c     1
1  d,e,f     2
>>> separate_row(a, "var1")
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

您还可以稍微更改该函数以支持分离列表类型行。

其他回答

使用赋值和爆炸的一行代码:

    col1  col2
0  a,b,c     1
1  d,e,f     2

df.assign(col1 = df.col1.str.split(',')).explode('col1', ignore_index=True)

输出:

  col1  col2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

另一种解决方案是使用python复制包

import copy
new_observations = list()
def pandas_explode(df, column_to_explode):
    new_observations = list()
    for row in df.to_dict(orient='records'):
        explode_values = row[column_to_explode]
        del row[column_to_explode]
        if type(explode_values) is list or type(explode_values) is tuple:
            for explode_value in explode_values:
                new_observation = copy.deepcopy(row)
                new_observation[column_to_explode] = explode_value
                new_observations.append(new_observation) 
        else:
            new_observation = copy.deepcopy(row)
            new_observation[column_to_explode] = explode_values
            new_observations.append(new_observation) 
    return_df = pd.DataFrame(new_observations)
    return return_df

df = pandas_explode(df, column_name)

只是从上面使用了jiln的优秀答案,但需要展开以拆分多个列。我想分享一下。

def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split

returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row, row_accumulator, target_columns, separator):
    split_rows = []
    for target_column in target_columns:
        split_rows.append(row[target_column].split(separator))
    # Seperate for multiple columns
    for i in range(len(split_rows[0])):
        new_row = row.to_dict()
        for j in range(len(split_rows)):
            new_row[target_columns[j]] = split_rows[j][i]
        row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pd.DataFrame(new_rows)
return new_df

我很欣赏“常舍”的回答,真的,但是iterrows()函数在大型数据集上花费很长时间。我面对了这个问题,然后我走到了这一步。

# First, reset_index to make the index a column
a = a.reset_index().rename(columns={'index':'duplicated_idx'})

# Get a longer series with exploded cells to rows
series = pd.DataFrame(a['var1'].str.split('/')
                      .tolist(), index=a.duplicated_idx).stack()

# New df from series and merge with the old one
b = series.reset_index([0, 'duplicated_idx'])
b = b.rename(columns={0:'var1'})

# Optional & Advanced: In case, there are other columns apart from var1 & var2
b.merge(
    a[a.columns.difference(['var1'])],
    on='duplicated_idx')

# Optional: Delete the "duplicated_index"'s column, and reorder columns
b = b[a.columns.difference(['duplicated_idx'])]

我提出了一个具有任意列数的数据框架的解决方案(同时一次仍然只分离一列的条目)。

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split

    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pandas.DataFrame(new_rows)
    return new_df