如何在纯SQL中请求随机行(或尽可能接近真正的随机)?
当前回答
对于SQL Server和需要“单个随机行”..
如果不需要真采样,生成一个随机值[0,max_rows)并使用ORDER BY..OFFSET..从SQL Server 2012+获取。
如果COUNT和ORDER BY在适当的索引上,这是非常快的——这样数据就已经沿着查询行“排序”了。如果涵盖了这些操作,那么它就是一个快速请求,并且不会受到使用ORDER BY NEWID()或类似方法的可怕可伸缩性的影响。显然,这种方法在非索引的HEAP表上不能很好地伸缩。
declare @rows int
select @rows = count(1) from t
-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())
select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only
确保使用了适当的事务隔离级别和/或考虑0结果。
对于SQL Server,需要一个“一般行样本”的方法..
注意:这是一个在SQL Server上找到的关于获取行样本的特定问题的答案的改编。它是根据上下文量身定制的。
虽然这里应该谨慎使用一般抽样方法,但对于其他答案(以及关于非伸缩和/或有问题的实现的重复建议),它仍然是潜在的有用信息。如果目标是找到“单个随机行”,那么这种抽样方法的效率低于所示的第一个代码,并且容易出错。
这是一个更新和改进的对行百分比进行抽样的形式。它基于与其他一些使用CHECKSUM / BINARY_CHECKSUM和modulus的答案相同的概念。
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal. Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution. Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable. Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage. Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
这是要点。有关更多细节和注意事项,请参阅这个答案。
Naï亿一下:
declare @sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select @sample_percent = 100.0 / count(1) from t
-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
t.*
from t
where 1=1
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
这可以在很大程度上通过混合抽样和ORDER by从小得多的样本集中选择的混合查询来补救。这将排序操作限制为样本大小,而不是原始表的大小。
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare @rows int
select @rows = count(1) from t
declare @sample_size int = 1000
declare @sample_percent decimal(7, 4) = case
when @rows <= 1000 then 100 -- not enough rows
when (100.0 * @sample_size / @rows) < 0.0001 then 0.0001 -- min sample percent
else 100.0 * @sample_size / @rows -- everything else
end
-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
t.*
from t
where 1=1
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
其他回答
我不知道这有多有效,但我以前用过:
SELECT TOP 1 * FROM MyTable ORDER BY newid()
因为guid是非常随机的,所以顺序意味着您得到的是随机行。
而不是使用RAND(),因为它是不鼓励的,你可以简单地得到max ID (= max):
SELECT MAX(ID) FROM TABLE;
在1..Max (= My_Generated_Random)
My_Generated_Random = rand_in_your_programming_lang_function(1..Max);
然后运行SQL:
SELECT ID FROM TABLE WHERE ID >= My_Generated_Random ORDER BY ID LIMIT 1
注意,它将检查id等于或高于所选值的任何行。 也可以在表中寻找行,并获得一个等于或低于My_Generated_Random的ID,然后修改查询如下:
SELECT ID FROM TABLE WHERE ID <= My_Generated_Random ORDER BY ID DESC LIMIT 1
火鸟:
Select FIRST 1 column from table ORDER BY RAND()
对于SQL Server 2005及以上版本,在num_value没有连续值的情况下扩展@GreyPanther的答案。这也适用于数据集分布不均匀以及num_value不是数字而是唯一标识符的情况。
WITH CTE_Table (SelRow, num_value)
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY ID) AS SelRow, num_value FROM table
)
SELECT * FROM table Where num_value = (
SELECT TOP 1 num_value FROM CTE_Table WHERE SelRow >= RAND() * (SELECT MAX(SelRow) FROM CTE_Table)
)
请参阅这篇文章:从数据库表中随机选择一行的SQL。它介绍了在MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2和Oracle中执行此操作的方法(以下内容是从该链接复制的):
用MySQL随机选择一行:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
使用PostgreSQL随机选择一行:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
使用Microsoft SQL Server随机选择一行:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
使用IBM DB2选择一个随机行
SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY
使用Oracle随机选择一条记录:
SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1
推荐文章
- 比较两个SQL Server数据库(模式和数据)的最佳工具是什么?
- 在SQL中,如何在范围中“分组”?
- 选项(RECOMPILE)总是更快;为什么?
- 设置数据库从单用户模式到多用户
- oracle中的RANK()和DENSE_RANK()函数有什么区别?
- 的类型不能用作索引中的键列
- SQL逻辑运算符优先级:And和Or
- 如何检查一个表是否存在于给定的模式中
- 添加一个复合主键
- 如何在SQL Server Management Studio中查看查询历史
- 生成具有给定(数值)分布的随机数
- 可以为公共表表达式创建嵌套WITH子句吗?
- 什么时候我需要在Oracle SQL中使用分号vs斜杠?
- SQL Server的NOW()?
- 在SQL中,count(列)和count(*)之间的区别是什么?