选择每个GROUP BY组中的第一行？

正如标题所示，我想选择用GROUP BY分组的每组行中的第一行。

具体来说，如果我有一个如下所示的采购表：

SELECT * FROM purchases;

我的输出：

id	customer	total
1	Joe	5
2	Sally	3
3	Joe	2
4	Sally	1

我想查询每个客户的最大购买量（总购买量）。类似于：

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

预期输出：

FIRST(id)	customer	FIRST(total)
1	Joe	5
2	Sally	3

当前回答

在支持CTE和窗口功能的数据库上：

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rank
      FROM PURCHASES p)
 SELECT *
   FROM summary
 WHERE rank = 1

任何数据库都支持：

但你需要添加逻辑来打破联系：

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

2010-09-27 01:27:54

其他回答

这是一个常见的最大的每组问题，它已经有了经过充分测试和高度优化的解决方案。就我个人而言，我更喜欢比尔·卡尔温（Bill Karwin）的左联解决方案（原始帖子中有很多其他解决方案）。

请注意，MySQL手册中可以意外地找到解决这个常见问题的一系列解决方案——尽管您的问题是在Postgres中，而不是MySQL中，但给出的解决方案应该适用于大多数SQL变体。请参见常见查询的示例：：保持某一列的组最大值的行。

2013-06-27 08:38:44

对于SQl Server，最有效的方法是：

with
ids as ( --condition for split table into groups
    select i from (values (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) as v(i) 
) 
,src as ( 
    select * from yourTable where  <condition> --use this as filter for other conditions
)
,joined as (
    select tops.* from ids 
    cross apply --it`s like for each rows
    (
        select top(1) * 
        from src
        where CommodityId = ids.i 
    ) as tops
)
select * from joined

不要忘记为使用过的列创建聚集索引

2019-01-18 10:59:03

基准

我测试了最有趣的候选人：

最初使用Postgres 9.4和9.5。后来为Postgres 13添加了重音测试。

基本测试设置

主表：采购：

CREATE TABLE purchases (
  id          serial  -- PK constraint added below
, customer_id int     -- REFERENCES customer
, total       int     -- could be amount of money in Cent
, some_column text    -- to make the row bigger, more realistic
);

虚拟数据（带有一些死元组），PK，索引：

INSERT INTO purchases (customer_id, total, some_column)    -- 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k distinct customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9;  -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

客户表-用于优化查询：

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

在9.5的第二次测试中，我使用了相同的设置，但使用了100000个不同的customer_id来获得每个customer_id的几行。

表购买的对象大小

基本设置：购买200k行，10k个不同的customer_id，平均每个客户20行。对于Postgres9.5，我添加了第二个测试，共有86446个不同的客户-平均每个客户2.3行。

使用此处的查询生成：

测量PostgreSQL表行的大小

为Postgres 9.5收集：

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

查询

1.CTE中的row_number（），（参见其他答案）

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2.子查询中的row_number（）（我的优化）

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3.DISTINCT ON（请参阅其他答案）

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4.带有LATERAL子查询的rCTE（参见此处）

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5.带LATERAL的客户表（见此处）

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6.带有ORDER BY的array_agg（）（请参见其他答案）

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

后果

使用EXPLAIN（ANALYZE、TIMING OFF、COSTS OFF）执行上述查询的执行时间，与热缓存相比，这是5次运行中最好的一次。

所有查询都在purchases2_3c_idx上使用了“仅索引扫描”（以及其他步骤）。有些人只是为了从较小的指数规模中获益，其他人则更有效。

A.Postgres 9.4，200k行，每个customer_id约20行

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  -- !
5.  37.679 ms  -- winner
6. 189.495 ms

B.与A.相同，Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  -- !
5.  33.944 ms  -- winner
6. 211.540 ms

C.与B.相同，但每个customer_id有约2.3行

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

2021-08-11年与Postgres 13重新测试

简化的测试设置：没有删除的行，因为VACUUM ANALYZE完全清除了简单情况下的表。

Postgres的重要变化：

一般性能改进。CTE可以从Postgres 12开始内联，因此查询1。和2。现在执行基本相同的（相同的查询计划）。

D.类似B.每个customer_id约20行

1. 103 ms
2. 103 ms  
3.  23 ms  -- winner  
4.  71 ms  
5.  22 ms  -- winner
6.  81 ms

db<>fiddle在这里

E.类似C.每个customer_id约2.3行

1. 127 ms
2. 126 ms  
3.  36 ms  -- winner  
4. 620 ms  
5. 145 ms
6. 203 ms

db<>fiddle在这里

参加Postgres 13考试

每个客户有1百万行，10.000对100对1.6行。

F.每个客户约10.000行

1. 526 ms
2. 527 ms  
3. 127 ms
4.   2 ms  -- winner !
5.   1 ms  -- winner !
6. 356 ms

db<>fiddle在这里

G.每个客户约100行

1. 535 ms
2. 529 ms  
3. 132 ms
4. 108 ms  -- !
5.  71 ms  -- winner
6. 376 ms

db<>fiddle在这里

H.每个客户约1.6行

1.  691 ms
2.  684 ms  
3.  234 ms  -- winner
4. 4669 ms
5. 1089 ms
6. 1264 ms

db<>fiddle在这里

结论

DISTINCT ON有效地使用索引，并且通常对每个组的几行执行最佳。即使每个组有很多行，它的性能也很好。对于每个组的许多行，使用rCTE模拟索引跳过扫描的性能最好，仅次于使用单独查找表（如果可用）的查询技术。在当前接受的答案中演示的row_number（）技术从未赢得任何性能测试。那时不行，现在不行。它从未接近DISTINCT ON，即使数据分布对后者不利。row_number（）唯一的优点是：它的扩展性不是很好，只是一般。

更多基准

在Postgres11.5上通过“ogr”进行基准测试，拥有1000万行和60万个独特的“客户”。结果与我们目前所看到的一致：

访问每个标识符的最新行的正确方法？

2011年原始（过时）基准

我用PostgreSQL 9.1在一个真实的表上运行了三次测试，该表包含65579行，三列中的每一列都有单列btree索引，并用了5次运行的最佳执行时间。将@OMGPonies的第一个查询（A）与上述DISTINCT ON解决方案（B）进行比较：

选择整个表，在本例中会产生5958行。

A: 567.218 ms
B: 386.673 ms

使用条件WHERE customer BETWEEN x AND y，得到1000行。

A: 249.136 ms
B:  55.111 ms

选择WHERE客户=x的单个客户。

A:   0.143 ms
B:   0.072 ms

用另一个答案中描述的索引重复相同的测试：

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms

2016-01-11 06:05:43

我通过窗口函数dbfiddle的方法：

将每组的row_number（）分配给（按agreement_id、order_id划分）为nrow只取组：filter中的第一行（其中nrow=1）

with intermediate as (select 
 *,
 row_number() over ( partition by agreement_id, order_id ) as nrow,
 (sum( suma ) over ( partition by agreement_id, order_id ))::numeric( 10, 2) as order_suma,
from <your table>)

select 
  *,
  sum( order_suma ) filter (where nrow = 1) over (partition by agreement_id)
from intermediate

2021-05-13 13:18:49

如Erwin所指出的，由于存在SubQ，解决方案不是很有效

select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;

2013-06-17 18:02:04

选择每个GROUP BY组中的第一行？

推荐文章

最新文章

标签