我使用熊猫作为数据库替代品,因为我有多个数据库(Oracle, SQL Server等),我无法使一个SQL等量命令序列。
我有一个表加载在一个DataFrame与一些列:
YEARMONTH, CLIENTCODE, SIZE, etc., etc.
在SQL中,计算每年不同客户端的数量将是:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
结果就是
201301 5000
201302 13245
我如何在熊猫中做到这一点?
这里有一种方法可以使多个列上的计数不同。让我们看一些数据:
data = {'CLIENT_CODE':[1,1,2,1,2,2,3],
'YEAR_MONTH':[201301,201301,201301,201302,201302,201302,201302],
'PRODUCT_CODE': [100,150,220,400,50,80,100]
}
table = pd.DataFrame(data)
table
CLIENT_CODE YEAR_MONTH PRODUCT_CODE
0 1 201301 100
1 1 201301 150
2 2 201301 220
3 1 201302 400
4 2 201302 50
5 2 201302 80
6 3 201302 100
现在,列出感兴趣的列,并在稍微修改的语法中使用groupby:
columns = ['YEAR_MONTH', 'PRODUCT_CODE']
table[columns].groupby(table['CLIENT_CODE']).nunique()
我们获得:
YEAR_MONTH PRODUCT_CODE CLIENT_CODE
1 2 3
2 2 3
3 1 1
现在你也可以在Python中使用dplyr语法来做到这一点:
>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>>
>>> data = tibble(
... CLIENT_CODE=[1,1,2,1,2,2,3],
... YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>>
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
YEAR_MONTH n
<int64> <int64>
0 201301 2
1 201302 3
创建一个数据透视表并使用非唯一级数函数:
ID = [ 123, 123, 123, 456, 456, 456, 456, 789, 789]
domain = ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
'vk.com', 'google.com', 'twitter.com', 'vk.com']
df = pd.DataFrame({'id':ID, 'domain':domain})
fp = pd.pivot_table(data=df, index='domain', aggfunc=pd.Series.nunique)
print(fp)
输出:
id
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
To obtain the number of different clients and sizes per year (i.e. number of unique values of multiple columns), use a list of columns:
df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique()
Actually, the result from the above code can be obtained using SQL syntax on df using pandasql (a module built on pandas that lets you query pandas DataFrames using SQL syntax).
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
""")
If you want to keep YEARMONTH as a column, i.e. the analog of the following SQL query
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
in pandas is the following (set as_index to False):
df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique()
If you need to set custom names to the aggregated columns, i.e. the analog of the following SQL query:
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE) AS `No. clients`,
COUNT(DISTINCT SIZE) AS `No. size`
FROM df
GROUP BY YEARMONTH
use named aggregation in pandas:
(
df.groupby('YEARMONTH', as_index=False)
.agg(**{'No. clients':('CLIENTCODE', 'nunique'),
'No. size':('SIZE', 'nunique')})
)