基准
我测试了最有趣的候选人:
最初使用Postgres 9.4和9.5。后来为Postgres 13添加了重音测试。
基本测试设置
主表:采购:
CREATE TABLE purchases (
id serial -- PK constraint added below
, customer_id int -- REFERENCES customer
, total int -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);
虚拟数据(带有一些死元组),PK,索引:
INSERT INTO purchases (customer_id, total, some_column) -- 200k rows
SELECT (random() * 10000)::int AS customer_id -- 10k distinct customers
, (random() * random() * 100000)::int AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM generate_series(1,200000) g;
ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);
DELETE FROM purchases WHERE random() > 0.9; -- some dead rows
INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int AS customer_id -- 10k customers
, (random() * random() * 100000)::int AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM generate_series(1,20000) g; -- add 20k to make it ~ 200k
CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);
VACUUM ANALYZE purchases;
客户表-用于优化查询:
CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM purchases
GROUP BY 1
ORDER BY 1;
ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);
VACUUM ANALYZE customer;
在9.5的第二次测试中,我使用了相同的设置,但使用了100000个不同的customer_id来获得每个customer_id的几行。
表购买的对象大小
基本设置:购买200k行,10k个不同的customer_id,平均每个客户20行。对于Postgres9.5,我添加了第二个测试,共有86446个不同的客户-平均每个客户2.3行。
使用此处的查询生成:
测量PostgreSQL表行的大小
为Postgres 9.5收集:
what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
core_relation_size | 20496384 | 20 MB | 102
visibility_map | 0 | 0 bytes | 0
free_space_map | 24576 | 24 kB | 0
table_size_incl_toast | 20529152 | 20 MB | 102
indexes_size | 10977280 | 10 MB | 54
total_size_incl_toast_and_indexes | 31506432 | 30 MB | 157
live_rows_in_text_representation | 13729802 | 13 MB | 68
------------------------------ | | |
row_count | 200045 | |
live_tuples | 200045 | |
dead_tuples | 19955 | |
查询
1.CTE中的row_number(),(参见其他答案)
WITH cte AS (
SELECT id, customer_id, total
, row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
)
SELECT id, customer_id, total
FROM cte
WHERE rn = 1;
2.子查询中的row_number()(我的优化)
SELECT id, customer_id, total
FROM (
SELECT id, customer_id, total
, row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
) sub
WHERE rn = 1;
3.DISTINCT ON(请参阅其他答案)
SELECT DISTINCT ON (customer_id)
id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC, id;
4.带有LATERAL子查询的rCTE(参见此处)
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC
LIMIT 1
)
UNION ALL
SELECT u.*
FROM cte c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id > c.customer_id -- lateral reference
ORDER BY customer_id, total DESC
LIMIT 1
) u
)
SELECT id, customer_id, total
FROM cte
ORDER BY customer_id;
5.带LATERAL的客户表(见此处)
SELECT l.*
FROM customer c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id = c.customer_id -- lateral reference
ORDER BY total DESC
LIMIT 1
) l;
6.带有ORDER BY的array_agg()(请参见其他答案)
SELECT (array_agg(id ORDER BY total DESC))[1] AS id
, customer_id
, max(total) AS total
FROM purchases
GROUP BY customer_id;
后果
使用EXPLAIN(ANALYZE、TIMING OFF、COSTS OFF)执行上述查询的执行时间,与热缓存相比,这是5次运行中最好的一次。
所有查询都在purchases2_3c_idx上使用了“仅索引扫描”(以及其他步骤)。有些人只是为了从较小的指数规模中获益,其他人则更有效。
A.Postgres 9.4,200k行,每个customer_id约20行
1. 273.274 ms
2. 194.572 ms
3. 111.067 ms
4. 92.922 ms -- !
5. 37.679 ms -- winner
6. 189.495 ms
B.与A.相同,Postgres 9.5
1. 288.006 ms
2. 223.032 ms
3. 107.074 ms
4. 78.032 ms -- !
5. 33.944 ms -- winner
6. 211.540 ms
C.与B.相同,但每个customer_id有约2.3行
1. 381.573 ms
2. 311.976 ms
3. 124.074 ms -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms
2021-08-11年与Postgres 13重新测试
简化的测试设置:没有删除的行,因为VACUUM ANALYZE完全清除了简单情况下的表。
Postgres的重要变化:
一般性能改进。CTE可以从Postgres 12开始内联,因此查询1。和2。现在执行基本相同的(相同的查询计划)。
D.类似B.每个customer_id约20行
1. 103 ms
2. 103 ms
3. 23 ms -- winner
4. 71 ms
5. 22 ms -- winner
6. 81 ms
db<>fiddle在这里
E.类似C.每个customer_id约2.3行
1. 127 ms
2. 126 ms
3. 36 ms -- winner
4. 620 ms
5. 145 ms
6. 203 ms
db<>fiddle在这里
参加Postgres 13考试
每个客户有1百万行,10.000对100对1.6行。
F.每个客户约10.000行
1. 526 ms
2. 527 ms
3. 127 ms
4. 2 ms -- winner !
5. 1 ms -- winner !
6. 356 ms
db<>fiddle在这里
G.每个客户约100行
1. 535 ms
2. 529 ms
3. 132 ms
4. 108 ms -- !
5. 71 ms -- winner
6. 376 ms
db<>fiddle在这里
H.每个客户约1.6行
1. 691 ms
2. 684 ms
3. 234 ms -- winner
4. 4669 ms
5. 1089 ms
6. 1264 ms
db<>fiddle在这里
结论
DISTINCT ON有效地使用索引,并且通常对每个组的几行执行最佳。即使每个组有很多行,它的性能也很好。对于每个组的许多行,使用rCTE模拟索引跳过扫描的性能最好,仅次于使用单独查找表(如果可用)的查询技术。在当前接受的答案中演示的row_number()技术从未赢得任何性能测试。那时不行,现在不行。它从未接近DISTINCT ON,即使数据分布对后者不利。row_number()唯一的优点是:它的扩展性不是很好,只是一般。
更多基准
在Postgres11.5上通过“ogr”进行基准测试,拥有1000万行和60万个独特的“客户”。结果与我们目前所看到的一致:
访问每个标识符的最新行的正确方法?
2011年原始(过时)基准
我用PostgreSQL 9.1在一个真实的表上运行了三次测试,该表包含65579行,三列中的每一列都有单列btree索引,并用了5次运行的最佳执行时间。将@OMGPonies的第一个查询(A)与上述DISTINCT ON解决方案(B)进行比较:
选择整个表,在本例中会产生5958行。
A: 567.218 ms
B: 386.673 ms
使用条件WHERE customer BETWEEN x AND y,得到1000行。
A: 249.136 ms
B: 55.111 ms
选择WHERE客户=x的单个客户。
A: 0.143 ms
B: 0.072 ms
用另一个答案中描述的索引重复相同的测试:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
1A: 277.953 ms
1B: 193.547 ms
2A: 249.796 ms -- special index not used
2B: 28.679 ms
3A: 0.120 ms
3B: 0.048 ms