我使用熊猫作为数据库替代品,因为我有多个数据库(Oracle, SQL Server等),我无法使一个SQL等量命令序列。
我有一个表加载在一个DataFrame与一些列:
YEARMONTH, CLIENTCODE, SIZE, etc., etc.
在SQL中,计算每年不同客户端的数量将是:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
结果就是
201301 5000
201302 13245
我如何在熊猫中做到这一点?
To obtain the number of different clients and sizes per year (i.e. number of unique values of multiple columns), use a list of columns:
df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique()
Actually, the result from the above code can be obtained using SQL syntax on df using pandasql (a module built on pandas that lets you query pandas DataFrames using SQL syntax).
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
""")
If you want to keep YEARMONTH as a column, i.e. the analog of the following SQL query
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
in pandas is the following (set as_index to False):
df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique()
If you need to set custom names to the aggregated columns, i.e. the analog of the following SQL query:
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE) AS `No. clients`,
COUNT(DISTINCT SIZE) AS `No. size`
FROM df
GROUP BY YEARMONTH
use named aggregation in pandas:
(
df.groupby('YEARMONTH', as_index=False)
.agg(**{'No. clients':('CLIENTCODE', 'nunique'),
'No. size':('SIZE', 'nunique')})
)