我有一个熊猫数据帧df像:

a b
A 1
A 2
B 5
B 5
B 4
C 6

我想按第一列分组,并将第二列作为行中的列表:

A [1,2]
B [5,5,4]
C [6]

是否有可能使用pandas groupby来做这样的事情?


当前回答

有点老了,但我是被指引到这里的。有办法把它按多个不同的列分组吗?

"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99

:

"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

其他回答

这里我用“|”作为分隔符对元素进行分组

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

根据@EdChum对他的回答的评论来回答。评论是这样的

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think 

让我们首先创建一个数据框架,第一列中有500k个类别,总df形状为2000万。

df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column 
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))

# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']

# Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)

print(gp_df.shape)
gp_df.head()

上面的代码花费2分钟处理第一列中的2000万行和500k个类别。

我们用df。带有列表和系列构造函数的groupby

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

要解决一个数据框架的几个列的问题:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

这个答案的灵感来自Anamika Modi的回答。谢谢你!

实现这一目标的简便方法是:

df.groupby('a').agg({'b':lambda x: list(x)})

考虑编写自定义聚合:https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py