如何在SQL请求一个随机行?

如何在纯SQL中请求随机行(或尽可能接近真正的随机)?

当前回答

晚了，但通过谷歌到达这里，所以为了子孙后代，我将添加一个替代解决方案。

另一种方法是使用TOP两次，顺序交替。我不知道它是否是“纯SQL”，因为它在TOP中使用了一个变量，但它在SQL Server 2008中工作。这里有一个例子，如果我想要一个随机的单词，我使用字典单词表。

SELECT TOP 1
  word
FROM (
  SELECT TOP(@idx)
    word 
  FROM
    dbo.DictionaryAbridged WITH(NOLOCK)
  ORDER BY
    word DESC
) AS D
ORDER BY
  word ASC

当然，@idx是目标表上从1到COUNT(*)的随机生成的整数。如果您的列被索引，您也会从中受益。另一个优点是可以在函数中使用它，因为NEWID()是不允许的。

最后，在同一个表上，上述查询的执行时间大约是NEWID()类型查询的1/10。YYMV。

2010-07-20 02:03:34

其他回答

晚了，但通过谷歌到达这里，所以为了子孙后代，我将添加一个替代解决方案。

SELECT TOP 1
  word
FROM (
  SELECT TOP(@idx)
    word 
  FROM
    dbo.DictionaryAbridged WITH(NOLOCK)
  ORDER BY
    word DESC
) AS D
ORDER BY
  word ASC

当然，@idx是目标表上从1到COUNT(*)的随机生成的整数。如果您的列被索引，您也会从中受益。另一个优点是可以在函数中使用它，因为NEWID()是不允许的。

最后，在同一个表上，上述查询的执行时间大约是NEWID()类型查询的1/10。YYMV。

2010-07-20 02:03:34

我不得不同意CD-MaN:使用“ORDER BY RAND()”将很好地用于小表或当你只做几次SELECT时。

我还使用“num_value >= RAND() *…”技术，如果我真的想获得随机结果，我在表中有一个特殊的“随机”列，我大约每天更新一次。单个UPDATE运行将花费一些时间(特别是因为必须在该列上建立索引)，但它比每次运行select时为每一行创建随机数快得多。

2008-08-21 07:20:31

请参阅这篇文章:从数据库表中随机选择一行的SQL。它介绍了在MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2和Oracle中执行此操作的方法(以下内容是从该链接复制的):

用MySQL随机选择一行:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

使用PostgreSQL随机选择一行:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

使用Microsoft SQL Server随机选择一行:

SELECT TOP 1 column FROM table
ORDER BY NEWID()

使用IBM DB2选择一个随机行

SELECT column, RAND() as IDX 
FROM table 
ORDER BY IDX FETCH FIRST 1 ROWS ONLY

使用Oracle随机选择一条记录:

SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1

2008-08-21 06:32:32

最好的方法是在新列中放入一个随机值，并使用如下代码(伪代码+ SQL):

randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")

这是MediaWiki代码采用的解决方案。当然，对于较小的值会有一些偏差，但他们发现，在没有获取行的情况下，将随机值包装为0就足够了。

Newid()解决方案可能需要全表扫描，以便为每一行分配一个新的guid，这将大大降低性能。

rand()解决方案可能根本不起作用(即与MSSQL)，因为函数将只计算一次，并且每一行将被分配相同的“随机”数字。

2008-08-21 06:36:10

对于SQL Server和需要“单个随机行”..

如果不需要真采样，生成一个随机值[0,max_rows)并使用ORDER BY..OFFSET..从SQL Server 2012+获取。

如果COUNT和ORDER BY在适当的索引上，这是非常快的——这样数据就已经沿着查询行“排序”了。如果涵盖了这些操作，那么它就是一个快速请求，并且不会受到使用ORDER BY NEWID()或类似方法的可怕可伸缩性的影响。显然，这种方法在非索引的HEAP表上不能很好地伸缩。

declare @rows int
select @rows = count(1) from t

-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())

select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only

确保使用了适当的事务隔离级别和/或考虑0结果。

对于SQL Server，需要一个“一般行样本”的方法..

注意:这是一个在SQL Server上找到的关于获取行样本的特定问题的答案的改编。它是根据上下文量身定制的。

虽然这里应该谨慎使用一般抽样方法，但对于其他答案(以及关于非伸缩和/或有问题的实现的重复建议)，它仍然是潜在的有用信息。如果目标是找到“单个随机行”，那么这种抽样方法的效率低于所示的第一个代码，并且容易出错。

这是一个更新和改进的对行百分比进行抽样的形式。它基于与其他一些使用CHECKSUM / BINARY_CHECKSUM和modulus的答案相同的概念。

It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal. Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution. Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable. Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage. Does not use TABLESAMPLE and thus works with a WHERE pre-filter.

这是要点。有关更多细节和注意事项，请参阅这个答案。

Naï亿一下:

declare @sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select @sample_percent = 100.0 / count(1) from t

-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )

这可以在很大程度上通过混合抽样和ORDER by从小得多的样本集中选择的混合查询来补救。这将排序操作限制为样本大小，而不是原始表的大小。

-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare @rows int
select @rows = count(1) from t

declare @sample_size int = 1000
declare @sample_percent decimal(7, 4) = case
    when @rows <= 1000 then 100                              -- not enough rows
    when (100.0 * @sample_size / @rows) < 0.0001 then 0.0001 -- min sample percent
    else 100.0 * @sample_size / @rows                        -- everything else
    end

-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

2021-02-12 22:10:30

如何在SQL请求一个随机行?

推荐文章

最新文章

标签