这显然很简单,但作为一个麻木的新手,我被卡住了。

我有一个CSV文件,其中包含3列,州,办公室ID,以及该办公室的销售。

我想计算给定州中每个办事处的销售额百分比(每个州所有百分比的总和为100%)。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

这将返回:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

我似乎不知道如何“达到”集团的州级,通过合计整个州的销售来计算分数。


当前回答

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)]})

grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()

返回:

sales
state   office_id   
AZ  2   54.587910
    4   33.009225
    6   12.402865
CA  1   32.046582
    3   44.937684
    5   23.015735
CO  1   21.099989
    3   31.848658
    5   47.051353
WA  2   43.882790
    4   10.265275
    6   45.851935

其他回答

你需要创建第二个groupby对象,按状态分组,然后使用div方法:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

div中的level='state' kwarg告诉pandas根据索引的状态级别的值广播/加入数据框架。

你可以把整个DataFrame加起来,然后除以状态总数:

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

返回

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

但请注意,这只是因为除了state之外的所有列都是数字,从而支持整个DataFrame的总和。例如,如果office_id是字符,你会得到一个错误:

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

/: 'str'和'str'不支持的操作数类型

我使用的简单方法是在2组之后合并,然后做简单的除法。

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()

输出:

    state   office_id   weightage
0   AZ  2   0.169814
1   AZ  4   0.192500
2   AZ  6   0.637686
3   CA  1   0.193319
4   CA  3   0.338587
5   CA  5   0.468094
6   CO  1   0.368519
7   CO  3   0.198743
8   CO  5   0.432739
9   WA  2   0.347072
10  WA  4   0.355113
11  WA  6   0.297815

df.groupby('state').office_id.value_counts(normalize = True)

我使用value_counts方法,但它返回的百分比像0.70和0.30,而不是像70和30。