我想在PostgreSQL中随机选择行,我尝试了这个:

select * from table where random() < 0.01;

但也有人建议:

select * from table order by random() limit 1000;

我有一个非常大的表,有5亿行,我希望它是快速的。

哪种方法更好?有什么不同?选择随机行最好的方法是什么?


当前回答

您可以通过使用来检查和比较两者的执行计划

EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;

对一个大型表1的快速测试表明,ORDER BY首先对整个表进行排序,然后选择前1000个项。对一个大表进行排序不仅要读取该表,还包括读取和写入临时文件。where random() < 0.1只扫描整个表一次。

对于大型表,这可能不是您想要的,因为即使是一次完整的表扫描也可能需要很长时间。

第三个建议是

select * from table where random() < 0.01 limit 1000;

这个方法在找到1000行后立即停止表扫描,因此返回得更快。当然,这将降低随机性,但也许这对于你来说已经足够好了。

编辑:除了这些考虑因素之外,你可以看看已经问过的问题。使用查询[postgresql]随机返回一些结果。

快速随机行选择在Postgres 如何从postgreSQL表检索随机数据行? Postgres:从表中获取随机条目-太慢

depez的一篇链接文章概述了更多的方法:

http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/


1“大”是指“完整的表将不适合内存”。

其他回答

我知道我有点晚了,但我刚刚找到了这个叫做pg_sample的很棒的工具:

pg_sample -从较大的PostgreSQL数据库中提取一个小的样本数据集,同时保持引用完整性。

我尝试了一个350M行的数据库,它真的很快,不知道随机性。

./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db
select * from table order by random() limit 1000;

如果知道需要多少行,请检查tsm_system_rows。

tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command. This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

首先安装扩展

CREATE EXTENSION tsm_system_rows;

然后你的问题,

SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);

ORDER BY的那个会比较慢。

Select * from table where random() < 0.01;逐条记录,然后决定是否随机过滤。这将是O(N)因为它只需要检查每个记录一次。

Select * from table order by random() limit 1000;将对整个表进行排序,然后选择前1000个。除去幕后的巫毒魔法,顺序是O(N * log N)。

random() < 0.01的缺点是,输出记录的数量是可变的。


注意,有一种比随机排序更好的方法来打乱一组数据:Fisher-Yates Shuffle,它在O(N)中运行。不过,在SQL中实现shuffle听起来很有挑战性。

Postgresql order by random(),按随机顺序选择行:

这是缓慢的,因为它对整个表进行排序,以保证每一行都有完全相等的机会被选中。全表扫描对于完美的随机性是不可避免的。

select your_columns from your_table ORDER BY random()

Postgresql order by random() with distinct:

select * from 
  (select distinct your_columns from your_table) table_alias
ORDER BY random()

Postgresql顺序随机限制一行:

这也很慢,因为它必须扫描表,以确保每一行都有相同的机会被选中,就在这一刻:

select your_columns from your_table ORDER BY random() limit 1

常数时间选择随机N行元素周期表扫描:

如果您的表非常大,那么上面的表扫描就需要花费5分钟才能完成。

为了更快,你可以安排一个幕后的夜间表扫描驯鹿,这将保证一个O(1)恒定时间速度的完美随机选择,除了在夜间索引表扫描期间,在你可能收到另一个随机行之前,它必须等待维护完成。

--Create a demo table with lots of random nonuniform data, big_data 
--is your huge table you want to get random rows from in constant time. 
drop table if exists big_data;  
CREATE TABLE big_data (id serial unique, some_data text );  
CREATE INDEX ON big_data (id);  
--Fill it with a million rows which simulates your beautiful data:  
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
 
--This delete statement puts holes in your index
--making it NONuniformly distributed  
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8); 
 
 
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper; 
CREATE TABLE big_data_mapper (id serial, big_data_id int); 
CREATE INDEX ON big_data_mapper (id); 
CREATE INDEX ON big_data_mapper (big_data_id); 
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
 
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row, 
--you try again until you succeed.  In the event the big_data_mapper 
--is broken, it tries 25 times then gives up and returns -1. 
CREATE or replace FUNCTION get_random_big_data_id()  
RETURNS int language plpgsql AS $$ 
declare  
    response int; 
BEGIN
    --Loop is required because big_data_mapper could be old
    --Keep rolling the dice until you find one that hits.
    for counter in 1..25 loop
        SELECT big_data_id 
        FROM big_data_mapper OFFSET floor(random() * ( 
            select max(id) biggest_value from big_data_mapper 
            )
        ) LIMIT 1 into response;
        if response is not null then
            return response;
        end if;
    end loop;
    return -1;
END;  
$$; 
 
--get a random big_data id in constant time: 
select get_random_big_data_id(); 
 
--Get 1 random row from big_data table in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 1 
); 
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
└─────────┴──────────────────────────────────┘ 

--Get 4 random rows from big_data in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 3 
);
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │ 
└─────────┴──────────────────────────────────┘ 

--Test what happens when big_data_mapper stops receiving 
--nightly reindexing.
delete from big_data_mapper where 1=1; 
select get_random_big_data_id();   --It tries 25 times, and returns -1
                                   --which means wait N minutes and try again.

改编自:https://www.gab.lc/articles/bigdata_postgresql_order_by_random

或者,如果以上都是太多的工作。

A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data.mapper_int make it not null with a unique index. Every night reset the column with a unique integer between 1 and max(n). To get a random row you "choose a random integer between 0 and max(id)" and return the row where mapper_int is that. If there's no row by that id, because the row has changed since re-index, choose another random row. If a row is added to big_data.mapper_int then populate it with max(id) + 1

您可以通过使用来检查和比较两者的执行计划

EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;

对一个大型表1的快速测试表明,ORDER BY首先对整个表进行排序,然后选择前1000个项。对一个大表进行排序不仅要读取该表,还包括读取和写入临时文件。where random() < 0.1只扫描整个表一次。

对于大型表,这可能不是您想要的,因为即使是一次完整的表扫描也可能需要很长时间。

第三个建议是

select * from table where random() < 0.01 limit 1000;

这个方法在找到1000行后立即停止表扫描,因此返回得更快。当然,这将降低随机性,但也许这对于你来说已经足够好了。

编辑:除了这些考虑因素之外,你可以看看已经问过的问题。使用查询[postgresql]随机返回一些结果。

快速随机行选择在Postgres 如何从postgreSQL表检索随机数据行? Postgres:从表中获取随机条目-太慢

depez的一篇链接文章概述了更多的方法:

http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/


1“大”是指“完整的表将不适合内存”。