熊猫的计数(不同)相当于

我使用熊猫作为数据库替代品，因为我有多个数据库(Oracle, SQL Server等)，我无法使一个SQL等量命令序列。

我有一个表加载在一个DataFrame与一些列:

YEARMONTH, CLIENTCODE, SIZE, etc., etc.

在SQL中，计算每年不同客户端的数量将是:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

结果就是

201301    5000
201302    13245

我如何在熊猫中做到这一点?

当前回答

To obtain the number of different clients and sizes per year (i.e. number of unique values of multiple columns), use a list of columns: df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique() Actually, the result from the above code can be obtained using SQL syntax on df using pandasql (a module built on pandas that lets you query pandas DataFrames using SQL syntax). #! pip install pandasql from pandasql import sqldf sqldf(""" SELECT COUNT(DISTINCT CLIENTCODE), COUNT(DISTINCT SIZE) FROM df GROUP BY YEARMONTH """) If you want to keep YEARMONTH as a column, i.e. the analog of the following SQL query SELECT YEARMONTH, COUNT(DISTINCT CLIENTCODE), COUNT(DISTINCT SIZE) FROM df GROUP BY YEARMONTH in pandas is the following (set as_index to False): df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique() If you need to set custom names to the aggregated columns, i.e. the analog of the following SQL query: SELECT YEARMONTH, COUNT(DISTINCT CLIENTCODE) AS `No. clients`, COUNT(DISTINCT SIZE) AS `No. size` FROM df GROUP BY YEARMONTH use named aggregation in pandas: ( df.groupby('YEARMONTH', as_index=False) .agg(**{'No. clients':('CLIENTCODE', 'nunique'), 'No. size':('SIZE', 'nunique')}) )

2022-08-30 07:57:26

其他回答

我相信这就是你想要的:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

例子:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

2013-03-14 14:09:06

现在你也可以在Python中使用dplyr语法来做到这一点:

>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>>
>>> data = tibble(
...     CLIENT_CODE=[1,1,2,1,2,2,3],
...     YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>>
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
   YEAR_MONTH       n
      <int64> <int64>
0      201301       2
1      201302       3

2021-06-17 02:16:17

使用crosstab，这将返回比groupby nunique更多的信息:

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]:
CLIENTCODE  1  2  3
YEARMONTH
201301      2  1  0
201302      1  2  1

稍加修改后，得到如下结果:

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]:
YEARMONTH
201301    2
201302    3
dtype: int64

2018-11-23 15:16:20

创建一个数据透视表并使用非唯一级数函数:

ID = [ 123, 123, 123, 456, 456, 456, 456, 789, 789]
domain = ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
          'vk.com', 'google.com', 'twitter.com', 'vk.com']
df = pd.DataFrame({'id':ID, 'domain':domain})
fp = pd.pivot_table(data=df, index='domain', aggfunc=pd.Series.nunique)
print(fp)

输出:

               id
domain
facebook.com   1
google.com     1
twitter.com    2
vk.com         3

2021-06-28 14:15:31

我也使用nunique，但如果你必须使用'min'， 'max'， 'count'或'mean'等聚合函数，这将是非常有用的。

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count

2019-08-01 09:38:19

熊猫的计数(不同)相当于

推荐文章

最新文章

标签