我想在PostgreSQL中随机选择行,我尝试了这个:
select * from table where random() < 0.01;
但也有人建议:
select * from table order by random() limit 1000;
我有一个非常大的表,有5亿行,我希望它是快速的。
哪种方法更好?有什么不同?选择随机行最好的方法是什么?
我想在PostgreSQL中随机选择行,我尝试了这个:
select * from table where random() < 0.01;
但也有人建议:
select * from table order by random() limit 1000;
我有一个非常大的表,有5亿行,我希望它是快速的。
哪种方法更好?有什么不同?选择随机行最好的方法是什么?
当前回答
Postgresql order by random(),按随机顺序选择行:
这是缓慢的,因为它对整个表进行排序,以保证每一行都有完全相等的机会被选中。全表扫描对于完美的随机性是不可避免的。
select your_columns from your_table ORDER BY random()
Postgresql order by random() with distinct:
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
Postgresql顺序随机限制一行:
这也很慢,因为它必须扫描表,以确保每一行都有相同的机会被选中,就在这一刻:
select your_columns from your_table ORDER BY random() limit 1
常数时间选择随机N行元素周期表扫描:
如果您的表非常大,那么上面的表扫描就需要花费5分钟才能完成。
为了更快,你可以安排一个幕后的夜间表扫描驯鹿,这将保证一个O(1)恒定时间速度的完美随机选择,除了在夜间索引表扫描期间,在你可能收到另一个随机行之前,它必须等待维护完成。
--Create a demo table with lots of random nonuniform data, big_data
--is your huge table you want to get random rows from in constant time.
drop table if exists big_data;
CREATE TABLE big_data (id serial unique, some_data text );
CREATE INDEX ON big_data (id);
--Fill it with a million rows which simulates your beautiful data:
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
--This delete statement puts holes in your index
--making it NONuniformly distributed
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8);
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper;
CREATE TABLE big_data_mapper (id serial, big_data_id int);
CREATE INDEX ON big_data_mapper (id);
CREATE INDEX ON big_data_mapper (big_data_id);
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row,
--you try again until you succeed. In the event the big_data_mapper
--is broken, it tries 25 times then gives up and returns -1.
CREATE or replace FUNCTION get_random_big_data_id()
RETURNS int language plpgsql AS $$
declare
response int;
BEGIN
--Loop is required because big_data_mapper could be old
--Keep rolling the dice until you find one that hits.
for counter in 1..25 loop
SELECT big_data_id
FROM big_data_mapper OFFSET floor(random() * (
select max(id) biggest_value from big_data_mapper
)
) LIMIT 1 into response;
if response is not null then
return response;
end if;
end loop;
return -1;
END;
$$;
--get a random big_data id in constant time:
select get_random_big_data_id();
--Get 1 random row from big_data table in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 1
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
└─────────┴──────────────────────────────────┘
--Get 4 random rows from big_data in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 3
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │
└─────────┴──────────────────────────────────┘
--Test what happens when big_data_mapper stops receiving
--nightly reindexing.
delete from big_data_mapper where 1=1;
select get_random_big_data_id(); --It tries 25 times, and returns -1
--which means wait N minutes and try again.
改编自:https://www.gab.lc/articles/bigdata_postgresql_order_by_random
或者,如果以上都是太多的工作。
A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data.mapper_int make it not null with a unique index. Every night reset the column with a unique integer between 1 and max(n). To get a random row you "choose a random integer between 0 and max(id)" and return the row where mapper_int is that. If there's no row by that id, because the row has changed since re-index, choose another random row. If a row is added to big_data.mapper_int then populate it with max(id) + 1
其他回答
如果只需要一行,可以使用从count派生的计算偏移量。
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
快速的方式
根据您的规格(以及评论中的其他信息),
您有一个数字ID列(整数),只有很少(或中等数量)的间隙。 显然没有或很少有写操作。 您的ID列必须被索引!主键很有用。
下面的查询不需要对大表进行顺序扫描,只需要进行索引扫描。
首先,获取主查询的估计值:
SELECT count(*) AS ct -- optional
, min(id) AS min_id
, max(id) AS max_id
, max(id) - min(id) AS id_span
FROM big;
唯一可能昂贵的部分是count(*)(对于巨大的表)。考虑到上述规格,您不需要它。用一个估计来代替完整的计数就可以了,几乎不需要任何成本:
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM pg_class
WHERE oid = 'big'::regclass; -- your table name
详细解释:
在PostgreSQL中快速发现表的行数
只要ct不比id_span小很多,查询的性能就优于其他方法。
WITH params AS (
SELECT 1 AS min_id -- minimum id <= current min id
, 5100000 AS id_span -- rounded up. (max_id - min_id + buffer)
)
SELECT *
FROM (
SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
FROM params p
, generate_series(1, 1100) g -- 1000 + buffer
GROUP BY 1 -- trim duplicates
) r
JOIN big USING (id)
LIMIT 1000; -- trim surplus
在id空间中生成随机数。您有“很少的空白”,因此将10%(足以轻松覆盖空白)添加到要检索的行数。 每个id都可以随机选择多次(尽管对于大id空格来说不太可能),因此将生成的数字分组(或使用DISTINCT)。 将id连接到大表中。这应该是非常快的索引到位。 最后,修剪多余的id,没有被dupes和gap吃掉。每一行都有完全相等的机会被选中。
短的版本
您可以简化这个查询。上面查询中的CTE仅用于教育目的:
SELECT *
FROM (
SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
FROM generate_series(1, 1100) g
) r
JOIN big USING (id)
LIMIT 1000;
使用rCTE进行细化
特别是当你对差距和估计不太确定的时候。
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs
LIMIT 1030 -- hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
UNION -- eliminate dupe
SELECT b.*
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM random_pick r -- plus 3 percent - adapt to your needs
LIMIT 999 -- less than 1000, hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
)
TABLE random_pick
LIMIT 1000; -- actual limit
我们可以在基本查询中使用较小的剩余。如果有太多的间隙,所以我们在第一次迭代中没有找到足够的行,rCTE将继续使用递归项进行迭代。我们仍然需要相对较少的ID空间间隙,否则递归可能会在达到极限之前耗尽——或者我们必须从一个足够大的缓冲区开始,这与优化性能的目的相矛盾。
在rCTE中,由UNION消除重复。
外部LIMIT使CTE停止,只要我们有足够的行。
这个查询是精心起草的,使用可用的索引,生成实际上是随机的行,直到达到限制才停止(除非递归耗尽)。如果你要重写它,这里会有很多陷阱。
包装成函数
重复使用相同的表,不同的参数:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
RETURNS SETOF big
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
_surplus int := _limit * _gaps;
_estimate int := ( -- get current estimate from system
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'big'::regclass);
BEGIN
RETURN QUERY
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM generate_series(1, _surplus) g
LIMIT _surplus -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM random_pick -- just to make it recursive
LIMIT _limit -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
)
TABLE random_pick
LIMIT _limit;
END
$func$;
电话:
SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);
泛型函数
我们可以让这个泛型适用于任何具有唯一整数列的表(通常是PK):将表作为多态类型传递,并(可选地)传递PK列的名称,并使用EXECUTE:
CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
, _id text = 'id'
, _limit int = 1000
, _gaps real = 1.03)
RETURNS SETOF anyelement
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
-- safe syntax with schema & quotes where needed
_tbl text := pg_typeof(_tbl_type)::text;
_estimate int := (SELECT (reltuples / relpages
* (pg_relation_size(oid) / 8192))::bigint
FROM pg_class -- get current estimate from system
WHERE oid = _tbl::regclass);
BEGIN
RETURN QUERY EXECUTE format(
$$
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM generate_series(1, $2) g
LIMIT $2 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM random_pick -- just to make it recursive
LIMIT $3 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses
)
TABLE random_pick
LIMIT $3;
$$
, _tbl, _id
)
USING _estimate -- $1
, (_limit * _gaps)::int -- $2 ("surplus")
, _limit -- $3
;
END
$func$;
调用默认值(重要!):
SELECT * FROM f_random_sample(null::big); --!
或者更具体地说:
SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);
性能与静态版本基本相同。
相关:
重构一个PL/pgSQL函数以返回各种SELECT查询的输出-章节“各种完整的表类型” 从PostgreSQL函数返回SETOF行 Format()用于EXECUTE? 在触发器函数中插入动态表名
这对于SQL注入是安全的。看到的:
表名作为PostgreSQL函数参数 Postgres函数中的SQL注入vs准备好的查询
可能的替代方法
如果你的需求允许重复调用的相同集(我们说的是重复调用)考虑一个物化视图。执行上述查询一次,并将结果写入一个表。用户以闪电般的速度获得准随机选择。每隔一段时间或你选择的事件刷新你的随机选择。
Postgres 9.5引入了TABLESAMPLE系统(n)
其中n是百分比。手册:
伯努利和系统抽样方法各接受一个单一 参数,它是要采样的表的分数,表示为a 0到100之间的百分比。这个参数可以是任何实值表达式。
大胆强调我的。它非常快,但结果不是完全随机的。再看一下手册:
SYSTEM方法明显比BERNOULLI方法快 当指定小的抽样百分比时,但它可能返回一个 由于聚类效应,表中样本的随机程度较低。
返回的行数变化很大。在我们的例子中,要获取大约1000行:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
相关:
在PostgreSQL中快速发现表的行数
或者安装额外的模块tsm_system_rows,以准确获取所请求的行数(如果有足够的行),并允许更方便的语法:
SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
详情见埃文的回答。
但这仍然不是完全随机的。
您可以通过使用来检查和比较两者的执行计划
EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;
对一个大型表1的快速测试表明,ORDER BY首先对整个表进行排序,然后选择前1000个项。对一个大表进行排序不仅要读取该表,还包括读取和写入临时文件。where random() < 0.1只扫描整个表一次。
对于大型表,这可能不是您想要的,因为即使是一次完整的表扫描也可能需要很长时间。
第三个建议是
select * from table where random() < 0.01 limit 1000;
这个方法在找到1000行后立即停止表扫描,因此返回得更快。当然,这将降低随机性,但也许这对于你来说已经足够好了。
编辑:除了这些考虑因素之外,你可以看看已经问过的问题。使用查询[postgresql]随机返回一些结果。
快速随机行选择在Postgres 如何从postgreSQL表检索随机数据行? Postgres:从表中获取随机条目-太慢
depez的一篇链接文章概述了更多的方法:
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
1“大”是指“完整的表将不适合内存”。
Postgresql order by random(),按随机顺序选择行:
这是缓慢的,因为它对整个表进行排序,以保证每一行都有完全相等的机会被选中。全表扫描对于完美的随机性是不可避免的。
select your_columns from your_table ORDER BY random()
Postgresql order by random() with distinct:
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
Postgresql顺序随机限制一行:
这也很慢,因为它必须扫描表,以确保每一行都有相同的机会被选中,就在这一刻:
select your_columns from your_table ORDER BY random() limit 1
常数时间选择随机N行元素周期表扫描:
如果您的表非常大,那么上面的表扫描就需要花费5分钟才能完成。
为了更快,你可以安排一个幕后的夜间表扫描驯鹿,这将保证一个O(1)恒定时间速度的完美随机选择,除了在夜间索引表扫描期间,在你可能收到另一个随机行之前,它必须等待维护完成。
--Create a demo table with lots of random nonuniform data, big_data
--is your huge table you want to get random rows from in constant time.
drop table if exists big_data;
CREATE TABLE big_data (id serial unique, some_data text );
CREATE INDEX ON big_data (id);
--Fill it with a million rows which simulates your beautiful data:
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
--This delete statement puts holes in your index
--making it NONuniformly distributed
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8);
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper;
CREATE TABLE big_data_mapper (id serial, big_data_id int);
CREATE INDEX ON big_data_mapper (id);
CREATE INDEX ON big_data_mapper (big_data_id);
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row,
--you try again until you succeed. In the event the big_data_mapper
--is broken, it tries 25 times then gives up and returns -1.
CREATE or replace FUNCTION get_random_big_data_id()
RETURNS int language plpgsql AS $$
declare
response int;
BEGIN
--Loop is required because big_data_mapper could be old
--Keep rolling the dice until you find one that hits.
for counter in 1..25 loop
SELECT big_data_id
FROM big_data_mapper OFFSET floor(random() * (
select max(id) biggest_value from big_data_mapper
)
) LIMIT 1 into response;
if response is not null then
return response;
end if;
end loop;
return -1;
END;
$$;
--get a random big_data id in constant time:
select get_random_big_data_id();
--Get 1 random row from big_data table in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 1
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
└─────────┴──────────────────────────────────┘
--Get 4 random rows from big_data in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 3
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │
└─────────┴──────────────────────────────────┘
--Test what happens when big_data_mapper stops receiving
--nightly reindexing.
delete from big_data_mapper where 1=1;
select get_random_big_data_id(); --It tries 25 times, and returns -1
--which means wait N minutes and try again.
改编自:https://www.gab.lc/articles/bigdata_postgresql_order_by_random
或者,如果以上都是太多的工作。
A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data.mapper_int make it not null with a unique index. Every night reset the column with a unique integer between 1 and max(n). To get a random row you "choose a random integer between 0 and max(id)" and return the row where mapper_int is that. If there's no row by that id, because the row has changed since re-index, choose another random row. If a row is added to big_data.mapper_int then populate it with max(id) + 1
Erwin Brandstetter所概述的物化观点“可能的替代方案”的变体是可能的。
例如,您不希望在返回的随机化值中出现重复值。一个示例用例是生成只能使用一次的短代码。
包含你的(非随机的)值集的主表必须有一些表达式来决定哪些行是“被使用的”,哪些行不是——在这里我将保持简单,只创建一个布尔列,并使用名称。
假设这是输入表(可能会添加其他列,因为它们不会影响解决方案):
id_values id | used
----+--------
1 | FALSE
2 | FALSE
3 | FALSE
4 | FALSE
5 | FALSE
...
根据需要填充ID_VALUES表。然后,正如Erwin所描述的,创建一个物化视图,将ID_VALUES表随机化一次:
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
注意,物化视图不包含已使用的列,因为这很快就会过时。视图也不需要包含id_values表中的其他列。
为了获得(并“使用”)随机值,在id_values上使用update - return,通过连接从id_values_randomised中选择id_values,并应用所需的条件来只获得相关的可能性。例如:
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 1)
RETURNING id;
根据需要更改LIMIT——如果一次需要多个随机值,请将LIMIT更改为n,其中n是所需值的数量。
With the proper indexes on id_values, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_values table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's added to the id_values table need to be made available.