最好的方法选择随机行PostgreSQL

select * from table order by random() limit 1000;

如果知道需要多少行，请检查tsm_system_rows。

tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command. This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

首先安装扩展

CREATE EXTENSION tsm_system_rows;

然后你的问题，

SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);

2016-12-27 01:03:24

ORDER BY的那个会比较慢。

Select * from table where random() < 0.01;逐条记录，然后决定是否随机过滤。这将是O(N)因为它只需要检查每个记录一次。

Select * from table order by random() limit 1000;将对整个表进行排序，然后选择前1000个。除去幕后的巫毒魔法，顺序是O(N * log N)。

random() < 0.01的缺点是，输出记录的数量是可变的。

注意，有一种比随机排序更好的方法来打乱一组数据:Fisher-Yates Shuffle，它在O(N)中运行。不过，在SQL中实现shuffle听起来很有挑战性。

2011-12-29 23:46:46

我的经验告诉我:

offset floor(random() * N) limit 1并不比order by random() limit 1快。

我认为偏移量方法会更快，因为它可以节省在Postgres中排序的时间。事实证明并非如此。

2019-06-19 18:59:26

select * from table order by random() limit 1000;

如果知道需要多少行，请检查tsm_system_rows。

tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command. This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

首先安装扩展

CREATE EXTENSION tsm_system_rows;

然后你的问题，

SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);

2016-12-27 01:03:24

您可以通过使用来检查和比较两者的执行计划

EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;

对一个大型表1的快速测试表明，ORDER BY首先对整个表进行排序，然后选择前1000个项。对一个大表进行排序不仅要读取该表，还包括读取和写入临时文件。where random() < 0.1只扫描整个表一次。

对于大型表，这可能不是您想要的，因为即使是一次完整的表扫描也可能需要很长时间。

第三个建议是

select * from table where random() < 0.01 limit 1000;

这个方法在找到1000行后立即停止表扫描，因此返回得更快。当然，这将降低随机性，但也许这对于你来说已经足够好了。

编辑:除了这些考虑因素之外，你可以看看已经问过的问题。使用查询[postgresql]随机返回一些结果。

快速随机行选择在Postgres 如何从postgreSQL表检索随机数据行? Postgres:从表中获取随机条目-太慢

depez的一篇链接文章概述了更多的方法:

http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

1“大”是指“完整的表将不适合内存”。

2011-12-29 23:51:24

这是一个对我有用的决定。我想这很容易理解和执行。

SELECT 
  field_1, 
  field_2, 
  field_2, 
  random() as ordering
FROM 
  big_table
WHERE 
  some_conditions
ORDER BY
  ordering 
LIMIT 1000;

2017-06-01 11:34:40

最好的方法选择随机行PostgreSQL

推荐文章

最新文章

标签