这显然很简单,但作为一个麻木的新手,我被卡住了。
我有一个CSV文件,其中包含3列,州,办公室ID,以及该办公室的销售。
我想计算给定州中每个办事处的销售额百分比(每个州所有百分比的总和为100%)。
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
这将返回:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
我似乎不知道如何“达到”集团的州级,通过合计整个州的销售来计算分数。
查找跨列或索引的百分比的最优雅方法是使用pd.crosstab。
样本数据
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
输出数据帧是这样的
print(df)
state office_id sales
0 CA 1 764505
1 WA 2 313980
2 CO 3 558645
3 AZ 4 883433
4 CA 5 301244
5 WA 6 752009
6 CO 1 457208
7 AZ 2 259657
8 CA 3 584471
9 WA 4 122358
10 CO 5 721845
11 AZ 6 136928
只需指定要聚合的索引、列和值。normalize关键字将根据上下文计算跨索引或列的%。
result = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index').applymap('{:.2f}%'.format)
print(result)
office_id 1 2 3 4 5 6
state
AZ 0.00% 0.20% 0.00% 0.69% 0.00% 0.11%
CA 0.46% 0.00% 0.35% 0.00% 0.18% 0.00%
CO 0.26% 0.00% 0.32% 0.00% 0.42% 0.00%
WA 0.00% 0.26% 0.00% 0.10% 0.00% 0.63%
我知道这是一个老问题,但是对于具有大量唯一组的数据集,exp1orer的答案是非常缓慢的(可能是因为lambda)。我建立了他们的答案,把它变成一个数组计算,所以现在它非常快!下面是示例代码:
创建带有50,000个唯一组的测试数据框架
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
当分组时,它看起来像:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
百分比数组法:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
这种方法大约需要0.15秒
顶部回答方法(使用lambda函数):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
这种方法需要大约21秒才能产生相同的结果。
结果:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494
你需要创建第二个groupby对象,按状态分组,然后使用div方法:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
div中的level='state' kwarg告诉pandas根据索引的状态级别的值广播/加入数据框架。
我使用的简单方法是在2组之后合并,然后做简单的除法。
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
state office_id sales_x sales_y sales_ratio
0 AZ 2 222579 1310725 16.981365
1 AZ 4 252315 1310725 19.250033
2 AZ 6 835831 1310725 63.768601
3 CA 1 405711 2098663 19.331879
4 CA 3 710581 2098663 33.858747
5 CA 5 982371 2098663 46.809373
6 CO 1 404137 1096653 36.851857
7 CO 3 217952 1096653 19.874290
8 CO 5 474564 1096653 43.273852
9 WA 2 535829 1543854 34.707233
10 WA 4 548242 1543854 35.511259
11 WA 6 459783 1543854 29.781508
更新2022 - 03
这个由caner使用变换的答案看起来比我原来的答案要好得多!
df['sales'] / df.groupby('state')['sales'].transform('sum')
感谢Paul Rougieux的评论。
原答案(2014)
Paul H的回答是正确的,您将必须创建第二个groupby对象,但是您可以用更简单的方法计算百分比——只需groupby state_office并将sales列除以它的和。复制Paul H回答的开头:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
返回:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508