这显然很简单,但作为一个麻木的新手,我被卡住了。

我有一个CSV文件,其中包含3列,州,办公室ID,以及该办公室的销售。

我想计算给定州中每个办事处的销售额百分比(每个州所有百分比的总和为100%)。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

这将返回:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

我似乎不知道如何“达到”集团的州级,通过合计整个州的销售来计算分数。


当前回答

(此解决方案的灵感来自这篇文章https://pbpython.com/pandas_transform.html)

我发现下面的解决方案是最简单的(可能是最快的)使用转换:

类的简化版本 数据转换时,可以返回一些转换后的完整版本 数据重组。对于这样的转换,输出是相同的 形状作为输入。

所以使用变换,解决方案是一行:

df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')

如果你打印:

print(df.sort_values(['state', 'office_id']).reset_index(drop=True))

   state  office_id   sales          %
0     AZ          2  195197   9.844309
1     AZ          4  877890  44.274352
2     AZ          6  909754  45.881339
3     CA          1  614752  50.415708
4     CA          3  395340  32.421767
5     CA          5  209274  17.162525
6     CO          1  549430  42.659629
7     CO          3  457514  35.522956
8     CO          5  280995  21.817415
9     WA          2  828238  35.696929
10    WA          4  719366  31.004563
11    WA          6  772590  33.298509

其他回答

我认为这一行就可以做到:

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

一行的解决方案:

df.join(
    df.groupby('state').agg(state_total=('sales', 'sum')),
    on='state'
).eval('sales / state_total')

这将返回一系列每个办公室的比率——可以单独使用,也可以分配给原始数据框架。

你可以把整个DataFrame加起来,然后除以状态总数:

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

返回

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

但请注意,这只是因为除了state之外的所有列都是数字,从而支持整个DataFrame的总和。例如,如果office_id是字符,你会得到一个错误:

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

/: 'str'和'str'不支持的操作数类型

更新2022 - 03

这个由caner使用变换的答案看起来比我原来的答案要好得多!

df['sales'] / df.groupby('state')['sales'].transform('sum')

感谢Paul Rougieux的评论。

原答案(2014)

Paul H的回答是正确的,您将必须创建第二个groupby对象,但是您可以用更简单的方法计算百分比——只需groupby state_office并将sales列除以它的和。复制Paul H回答的开头:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

返回:

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

我知道这是一个老问题,但是对于具有大量唯一组的数据集,exp1orer的答案是非常缓慢的(可能是因为lambda)。我建立了他们的答案,把它变成一个数组计算,所以现在它非常快!下面是示例代码:

创建带有50,000个唯一组的测试数据框架

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

当分组时,它看起来像:

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

百分比数组法:

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

这种方法大约需要0.15秒

顶部回答方法(使用lambda函数):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

这种方法需要大约21秒才能产生相同的结果。

结果:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494