我使用熊猫作为数据库替代品,因为我有多个数据库(Oracle, SQL Server等),我无法使一个SQL等量命令序列。
我有一个表加载在一个DataFrame与一些列:
YEARMONTH, CLIENTCODE, SIZE, etc., etc.
在SQL中,计算每年不同客户端的数量将是:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
结果就是
201301 5000
201302 13245
我如何在熊猫中做到这一点?
创建一个数据透视表并使用非唯一级数函数:
ID = [ 123, 123, 123, 456, 456, 456, 456, 789, 789]
domain = ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
'vk.com', 'google.com', 'twitter.com', 'vk.com']
df = pd.DataFrame({'id':ID, 'domain':domain})
fp = pd.pivot_table(data=df, index='domain', aggfunc=pd.Series.nunique)
print(fp)
输出:
id
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
To obtain the number of different clients and sizes per year (i.e. number of unique values of multiple columns), use a list of columns:
df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique()
Actually, the result from the above code can be obtained using SQL syntax on df using pandasql (a module built on pandas that lets you query pandas DataFrames using SQL syntax).
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
""")
If you want to keep YEARMONTH as a column, i.e. the analog of the following SQL query
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
in pandas is the following (set as_index to False):
df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique()
If you need to set custom names to the aggregated columns, i.e. the analog of the following SQL query:
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE) AS `No. clients`,
COUNT(DISTINCT SIZE) AS `No. size`
FROM df
GROUP BY YEARMONTH
use named aggregation in pandas:
(
df.groupby('YEARMONTH', as_index=False)
.agg(**{'No. clients':('CLIENTCODE', 'nunique'),
'No. size':('SIZE', 'nunique')})
)