最好的方法选择随机行PostgreSQL

我想在PostgreSQL中随机选择行，我尝试了这个:

select * from table where random() < 0.01;

但也有人建议:

select * from table order by random() limit 1000;

我有一个非常大的表，有5亿行，我希望它是快速的。

哪种方法更好?有什么不同?选择随机行最好的方法是什么?

当前回答

Postgresql order by random()，按随机顺序选择行:

这是缓慢的，因为它对整个表进行排序，以保证每一行都有完全相等的机会被选中。全表扫描对于完美的随机性是不可避免的。

select your_columns from your_table ORDER BY random()

Postgresql order by random() with distinct:

select * from 
  (select distinct your_columns from your_table) table_alias
ORDER BY random()

Postgresql顺序随机限制一行:

这也很慢，因为它必须扫描表，以确保每一行都有相同的机会被选中，就在这一刻:

select your_columns from your_table ORDER BY random() limit 1

常数时间选择随机N行元素周期表扫描:

如果您的表非常大，那么上面的表扫描就需要花费5分钟才能完成。

为了更快，你可以安排一个幕后的夜间表扫描驯鹿，这将保证一个O(1)恒定时间速度的完美随机选择，除了在夜间索引表扫描期间，在你可能收到另一个随机行之前，它必须等待维护完成。

--Create a demo table with lots of random nonuniform data, big_data 
--is your huge table you want to get random rows from in constant time. 
drop table if exists big_data;  
CREATE TABLE big_data (id serial unique, some_data text );  
CREATE INDEX ON big_data (id);  
--Fill it with a million rows which simulates your beautiful data:  
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
 
--This delete statement puts holes in your index
--making it NONuniformly distributed  
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8); 
 
 
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper; 
CREATE TABLE big_data_mapper (id serial, big_data_id int); 
CREATE INDEX ON big_data_mapper (id); 
CREATE INDEX ON big_data_mapper (big_data_id); 
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
 
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row, 
--you try again until you succeed.  In the event the big_data_mapper 
--is broken, it tries 25 times then gives up and returns -1. 
CREATE or replace FUNCTION get_random_big_data_id()  
RETURNS int language plpgsql AS $$ 
declare  
    response int; 
BEGIN
    --Loop is required because big_data_mapper could be old
    --Keep rolling the dice until you find one that hits.
    for counter in 1..25 loop
        SELECT big_data_id 
        FROM big_data_mapper OFFSET floor(random() * ( 
            select max(id) biggest_value from big_data_mapper 
            )
        ) LIMIT 1 into response;
        if response is not null then
            return response;
        end if;
    end loop;
    return -1;
END;  
$$; 
 
--get a random big_data id in constant time: 
select get_random_big_data_id(); 
 
--Get 1 random row from big_data table in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 1 
); 
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
└─────────┴──────────────────────────────────┘ 

--Get 4 random rows from big_data in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 3 
);
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │ 
└─────────┴──────────────────────────────────┘ 

--Test what happens when big_data_mapper stops receiving 
--nightly reindexing.
delete from big_data_mapper where 1=1; 
select get_random_big_data_id();   --It tries 25 times, and returns -1
                                   --which means wait N minutes and try again.

改编自:https://www.gab.lc/articles/bigdata_postgresql_order_by_random

或者，如果以上都是太多的工作。

A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data.mapper_int make it not null with a unique index. Every night reset the column with a unique integer between 1 and max(n). To get a random row you "choose a random integer between 0 and max(id)" and return the row where mapper_int is that. If there's no row by that id, because the row has changed since re-index, choose another random row. If a row is added to big_data.mapper_int then populate it with max(id) + 1

2013-01-22 01:49:04

其他回答

select * from table order by random() limit 1000;

如果知道需要多少行，请检查tsm_system_rows。

tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command. This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

首先安装扩展

CREATE EXTENSION tsm_system_rows;

然后你的问题，

SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);

2016-12-27 01:03:24

从PostgreSQL 9.5开始，有一个新的语法专门用于从表中获取随机元素:

SELECT * FROM mytable TABLESAMPLE SYSTEM (5);

这个例子将给出mytable中5%的元素。

有关文档的更多说明:http://www.postgresql.org/docs/current/static/sql-select.html

2016-01-25 10:24:28

我知道我有点晚了，但我刚刚找到了这个叫做pg_sample的很棒的工具:

pg_sample -从较大的PostgreSQL数据库中提取一个小的样本数据集，同时保持引用完整性。

我尝试了一个350M行的数据库，它真的很快，不知道随机性。

./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db

2018-03-08 12:43:04

快速的方式

根据您的规格(以及评论中的其他信息)，

您有一个数字ID列(整数)，只有很少(或中等数量)的间隙。显然没有或很少有写操作。您的ID列必须被索引!主键很有用。

下面的查询不需要对大表进行顺序扫描，只需要进行索引扫描。

首先，获取主查询的估计值:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
     , max(id)  AS max_id
     , max(id) - min(id) AS id_span
FROM   big;

唯一可能昂贵的部分是count(*)(对于巨大的表)。考虑到上述规格，您不需要它。用一个估计来代替完整的计数就可以了，几乎不需要任何成本:

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM   pg_class
WHERE  oid = 'big'::regclass;  -- your table name

详细解释:

在PostgreSQL中快速发现表的行数

只要ct不比id_span小很多，查询的性能就优于其他方法。

WITH params AS (
   SELECT 1       AS min_id           -- minimum id <= current min id
        , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
   SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
   FROM   params p
        , generate_series(1, 1100) g  -- 1000 + buffer
   GROUP  BY 1                        -- trim duplicates
) r
JOIN   big USING (id)
LIMIT  1000;                          -- trim surplus

在id空间中生成随机数。您有“很少的空白”，因此将10%(足以轻松覆盖空白)添加到要检索的行数。每个id都可以随机选择多次(尽管对于大id空格来说不太可能)，因此将生成的数字分组(或使用DISTINCT)。将id连接到大表中。这应该是非常快的索引到位。最后，修剪多余的id，没有被dupes和gap吃掉。每一行都有完全相等的机会被选中。

短的版本

您可以简化这个查询。上面查询中的CTE仅用于教育目的:

SELECT *
FROM  (
   SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
   FROM   generate_series(1, 1100) g
   ) r
JOIN   big USING (id)
LIMIT  1000;

使用rCTE进行细化

特别是当你对差距和估计不太确定的时候。

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
TABLE  random_pick
LIMIT  1000;  -- actual limit

我们可以在基本查询中使用较小的剩余。如果有太多的间隙，所以我们在第一次迭代中没有找到足够的行，rCTE将继续使用递归项进行迭代。我们仍然需要相对较少的ID空间间隙，否则递归可能会在达到极限之前耗尽——或者我们必须从一个足够大的缓冲区开始，这与优化性能的目的相矛盾。

在rCTE中，由UNION消除重复。

外部LIMIT使CTE停止，只要我们有足够的行。

这个查询是精心起草的，使用可用的索引，生成实际上是随机的行，直到达到限制才停止(除非递归耗尽)。如果你要重写它，这里会有很多陷阱。

包装成函数

重复使用相同的表，不同的参数:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
      FROM   pg_class
      WHERE  oid = 'big'::regclass);
BEGIN
   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  _limit;
END
$func$;

电话:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

泛型函数

我们可以让这个泛型适用于任何具有唯一整数列的表(通常是PK):将表作为多态类型传递，并(可选地)传递PK列的名称，并使用EXECUTE:

CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
                                         , _id text = 'id'
                                         , _limit int = 1000
                                         , _gaps real = 1.03)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   -- safe syntax with schema & quotes where needed
   _tbl text := pg_typeof(_tbl_type)::text;
   _estimate int := (SELECT (reltuples / relpages
                          * (pg_relation_size(oid) / 8192))::bigint
                     FROM   pg_class  -- get current estimate from system
                     WHERE  oid = _tbl::regclass);
BEGIN
   RETURN QUERY EXECUTE format(
   $$
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   generate_series(1, $2) g
         LIMIT  $2                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  $3                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  $3;
   $$
 , _tbl, _id
   )
   USING _estimate              -- $1
       , (_limit * _gaps)::int  -- $2 ("surplus")
       , _limit                 -- $3
   ;
END
$func$;

调用默认值(重要!):

SELECT * FROM f_random_sample(null::big);  --!

或者更具体地说:

SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);

性能与静态版本基本相同。

最好的方法选择随机行PostgreSQL

推荐文章

最新文章

标签