我使用熊猫作为数据库替代品,因为我有多个数据库(Oracle, SQL Server等),我无法使一个SQL等量命令序列。
我有一个表加载在一个DataFrame与一些列:
YEARMONTH, CLIENTCODE, SIZE, etc., etc.
在SQL中,计算每年不同客户端的数量将是:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
结果就是
201301 5000
201302 13245
我如何在熊猫中做到这一点?
To obtain the number of different clients and sizes per year (i.e. number of unique values of multiple columns), use a list of columns:
df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique()
Actually, the result from the above code can be obtained using SQL syntax on df using pandasql (a module built on pandas that lets you query pandas DataFrames using SQL syntax).
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
""")
If you want to keep YEARMONTH as a column, i.e. the analog of the following SQL query
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
in pandas is the following (set as_index to False):
df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique()
If you need to set custom names to the aggregated columns, i.e. the analog of the following SQL query:
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE) AS `No. clients`,
COUNT(DISTINCT SIZE) AS `No. size`
FROM df
GROUP BY YEARMONTH
use named aggregation in pandas:
(
df.groupby('YEARMONTH', as_index=False)
.agg(**{'No. clients':('CLIENTCODE', 'nunique'),
'No. size':('SIZE', 'nunique')})
)
现在你也可以在Python中使用dplyr语法来做到这一点:
>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>>
>>> data = tibble(
... CLIENT_CODE=[1,1,2,1,2,2,3],
... YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>>
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
YEAR_MONTH n
<int64> <int64>
0 201301 2
1 201302 3