这显然很简单,但作为一个麻木的新手,我被卡住了。

我有一个CSV文件,其中包含3列,州,办公室ID,以及该办公室的销售。

我想计算给定州中每个办事处的销售额百分比(每个州所有百分比的总和为100%)。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

这将返回:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

我似乎不知道如何“达到”集团的州级,通过合计整个州的销售来计算分数。


当前回答

查找跨列或索引的百分比的最优雅方法是使用pd.crosstab。

样本数据

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

输出数据帧是这样的

print(df)

        state   office_id   sales
    0   CA  1   764505
    1   WA  2   313980
    2   CO  3   558645
    3   AZ  4   883433
    4   CA  5   301244
    5   WA  6   752009
    6   CO  1   457208
    7   AZ  2   259657
    8   CA  3   584471
    9   WA  4   122358
    10  CO  5   721845
    11  AZ  6   136928

只需指定要聚合的索引、列和值。normalize关键字将根据上下文计算跨索引或列的%。

result = pd.crosstab(index=df['state'], 
                     columns=df['office_id'], 
                     values=df['sales'], 
                     aggfunc='sum', 
                     normalize='index').applymap('{:.2f}%'.format)




print(result)
office_id   1   2   3   4   5   6
state                       
AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%

其他回答

我认为这一行就可以做到:

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

我知道这是一个老问题,但是对于具有大量唯一组的数据集,exp1orer的答案是非常缓慢的(可能是因为lambda)。我建立了他们的答案,把它变成一个数组计算,所以现在它非常快!下面是示例代码:

创建带有50,000个唯一组的测试数据框架

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

当分组时,它看起来像:

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

百分比数组法:

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

这种方法大约需要0.15秒

顶部回答方法(使用lambda函数):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

这种方法需要大约21秒才能产生相同的结果。

结果:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494

你需要创建第二个groupby对象,按状态分组,然后使用div方法:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

div中的level='state' kwarg告诉pandas根据索引的状态级别的值广播/加入数据框架。

我使用的简单方法是在2组之后合并,然后做简单的除法。

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508

更新2022 - 03

这个由caner使用变换的答案看起来比我原来的答案要好得多!

df['sales'] / df.groupby('state')['sales'].transform('sum')

感谢Paul Rougieux的评论。

原答案(2014)

Paul H的回答是正确的,您将必须创建第二个groupby对象,但是您可以用更简单的方法计算百分比——只需groupby state_office并将sales列除以它的和。复制Paul H回答的开头:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

返回:

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508