我有一个pandas数据框架,其中一列文本字符串包含逗号分隔的值。我想拆分每个CSV字段,并为每个条目创建一个新行(假设CSV是干净的,只需要在','上拆分)。例如,a应该变成b:

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
Out[8]: 
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

到目前为止,我已经尝试了各种简单的函数,但是.apply方法在轴上使用时似乎只接受一行作为返回值,而且我不能让.transform工作。任何建议都将不胜感激!

示例数据:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

我知道这不会起作用,因为我们通过numpy丢失了DataFrame元数据,但它应该给你一个我试图做的感觉:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)

当前回答

升级了MaxU的答案,支持MultiIndex

def explode(df, lst_cols, fill_value='', preserve_index=False):
    """
    usage:
        In [134]: df
        Out[134]:
           aaa  myid        num          text
        0   10     1  [1, 2, 3]  [aa, bb, cc]
        1   11     2         []            []
        2   12     3     [1, 2]      [cc, dd]
        3   13     4         []            []

        In [135]: explode(df, ['num','text'], fill_value='')
        Out[135]:
           aaa  myid num text
        0   10     1   1   aa
        1   10     1   2   bb
        2   10     1   3   cc
        3   11     2
        4   12     3   1   cc
        5   12     3   2   dd
        6   13     4
    """
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)

    # if original index is MultiIndex build the dataframe from the multiindex
    # create "exploded" DF
    if isinstance(df.index, pd.MultiIndex):
        res = res.reindex(
            index=pd.MultiIndex.from_tuples(
                res.index,
                names=['number', 'color']
            )
    )
    return res

其他回答

有可能在不改变数据框架结构的情况下拆分和爆炸数据框架

拆分和展开特定列的数据

输入:

    var1    var2
0   a,b,c   1
1   d,e,f   2



#Get the indexes which are repetative with the split 
df['var1'] = df['var1'].str.split(',')
df = df.explode('var1')

Out:

    var1    var2
0   a   1
0   b   1
0   c   1
1   d   2
1   e   2
1   f   2

Edit-1

对多列的行进行拆分和展开

Filename    RGB                                             RGB_type
0   A   [[0, 1650, 6, 39], [0, 1691, 1, 59], [50, 1402...   [r, g, b]
1   B   [[0, 1423, 16, 38], [0, 1445, 16, 46], [0, 141...   [r, g, b]

基于参考列重新索引,并将列值信息与堆栈对齐

df = df.reindex(df.index.repeat(df['RGB_type'].apply(len)))
df = df.groupby('Filename').apply(lambda x:x.apply(lambda y: pd.Series(y.iloc[0])))
df.reset_index(drop=True).ffill()

Out:

                Filename    RGB_type    Top 1 colour    Top 1 frequency Top 2 colour    Top 2 frequency
    Filename                            
 A  0       A   r   0   1650    6   39
    1       A   g   0   1691    1   59
    2       A   b   50  1402    49  187
 B  0       B   r   0   1423    16  38
    1       B   g   0   1445    16  46
    2       B   b   0   1419    16  39

升级了MaxU的答案,支持MultiIndex

def explode(df, lst_cols, fill_value='', preserve_index=False):
    """
    usage:
        In [134]: df
        Out[134]:
           aaa  myid        num          text
        0   10     1  [1, 2, 3]  [aa, bb, cc]
        1   11     2         []            []
        2   12     3     [1, 2]      [cc, dd]
        3   13     4         []            []

        In [135]: explode(df, ['num','text'], fill_value='')
        Out[135]:
           aaa  myid num text
        0   10     1   1   aa
        1   10     1   2   bb
        2   10     1   3   cc
        3   11     2
        4   12     3   1   cc
        5   12     3   2   dd
        6   13     4
    """
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)

    # if original index is MultiIndex build the dataframe from the multiindex
    # create "exploded" DF
    if isinstance(df.index, pd.MultiIndex):
        res = res.reindex(
            index=pd.MultiIndex.from_tuples(
                res.index,
                names=['number', 'color']
            )
    )
    return res

这样怎么样:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

然后你只需要重命名列

经过痛苦的实验,我找到了比公认的答案更快的方法,我让这个方法起作用了。它在我试用的数据集上运行速度快了大约100倍。

如果有人知道如何使其更优雅,请务必修改我的代码。我找不到一种方法,不设置其他你想保留的列作为下标,然后重设下标,重命名列,但我想还有其他方法可以。

b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1

我有一个类似的问题,我的解决方案是将数据帧转换为字典列表,然后进行转换。函数如下:

import re
import pandas as pd

def separate_row(df, column_name):
    ls = []
    for row_dict in df.to_dict('records'):
        for word in re.split(',', row_dict[column_name]):
            row = row_dict.copy()
            row[column_name]=word
            ls.append(row)
    return pd.DataFrame(ls)

例子:

>>> from pandas import DataFrame
>>> import numpy as np
>>> a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
>>> a
    var1  var2
0  a,b,c     1
1  d,e,f     2
>>> separate_row(a, "var1")
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

您还可以稍微更改该函数以支持分离列表类型行。