我有一个SQL Server表,其中有大约50,000行。我想随机选择大约5000行。我想到了一种复杂的方法,创建一个带有“随机数”列的临时表,将我的表复制到其中,循环遍历临时表并使用RAND()更新每一行,然后从该表中选择随机数列< 0.1的列。我正在寻找一种更简单的方法,如果可能的话,在一个单一的声明中。
本文建议使用NEWID()函数。这看起来很有希望,但我不知道如何可靠地选择一定百分比的行。
有人做过这个吗?什么好主意吗?
我有一个SQL Server表,其中有大约50,000行。我想随机选择大约5000行。我想到了一种复杂的方法,创建一个带有“随机数”列的临时表,将我的表复制到其中,循环遍历临时表并使用RAND()更新每一行,然后从该表中选择随机数列< 0.1的列。我正在寻找一种更简单的方法,如果可能的话,在一个单一的声明中。
本文建议使用NEWID()函数。这看起来很有希望,但我不知道如何可靠地选择一定百分比的行。
有人做过这个吗?什么好主意吗?
select top 10 percent * from [yourtable] order by newid()
针对关于大型表的“纯粹垃圾”评论:您可以这样做以提高性能。
select * from [yourtable] where [yourPk] in
(select top 10 percent [yourPk] from [yourtable] order by newid())
这样做的成本将是值的关键扫描加上连接成本,对于一个选择百分比较小的大表来说,这应该是合理的。
在MySQL中,你可以这样做:
SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;
只需按一个随机数对表进行排序,并使用TOP获得前5000行。
SELECT TOP 5000 * FROM [Table] ORDER BY newid();
更新
刚刚尝试过,一个newid()调用就足够了——不需要所有的类型转换和所有的数学运算。
根据您的需要,TABLESAMPLE将为您提供几乎相同的随机和更好的性能。 这在MS SQL server 2005及更高版本上可用。
TABLESAMPLE将从随机页而不是随机行返回数据,因此它甚至不会检索它不会返回的数据。
我在一个非常大的桌子上测试
select top 1 percent * from [tablename] order by newid()
花了20多分钟。
select * from [tablename] tablesample(1 percent)
花了2分钟。
在TABLESAMPLE中,较小的样本也会提高性能,而使用newd()则不会。
请记住,这并不像newid()方法那样随机,但会给您一个不错的抽样。
请参见MSDN页面。
Newid ()/order by可以工作,但对于大型结果集来说代价非常高,因为它必须为每一行生成一个id,然后对它们进行排序。
从性能的角度来看,TABLESAMPLE()很好,但是您将得到结果的聚集(将返回页面上的所有行)。
为了获得更好的真实随机样本,最好的方法是随机过滤掉行。我在SQL Server Books Online文章使用TABLESAMPLE限制结果集中找到了以下代码示例:
If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table: SELECT * FROM Sales.SalesOrderDetail WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int) The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.
当对一个有1,000,000行的表运行时,下面是我的结果:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
如果您可以使用TABLESAMPLE,它将为您提供最佳性能。否则使用newwid ()/filter方法。如果结果集很大,Newid ()/order by应该是最后的选择。
从MSDN上的大表中随机选择行有一个简单的、明确的解决方案,可以解决大规模性能问题。
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
如果你(不像OP)需要特定数量的记录(这使得CHECKSUM方法很困难),并且想要一个比TABLESAMPLE本身提供的更随机的样本,并且也想要比CHECKSUM更好的速度,你可以将TABLESAMPLE和NEWID()方法合并,如下所示:
DECLARE @sampleCount int = 50
SET STATISTICS TIME ON
SELECT TOP (@sampleCount) *
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()
SET STATISTICS TIME OFF
就我而言,这是随机性(我知道这并不是真的)和速度之间最直接的妥协。适当地改变TABLESAMPLE百分比(或行数)——百分比越高,样本的随机性越大,但速度会有线性下降。(注意,TABLESAMPLE不接受变量)
我还没看出来答案有什么不同。我有一个额外的约束条件,给定一个初始种子,每次都要选择相同的行集。
对于MS SQL:
最小的例子:
select top 10 percent *
from table_name
order by rand(checksum(*))
规范化执行时间:1.00
NewId()例子:
select top 10 percent *
from table_name
order by newid()
规范化执行时间:1.02
NewId()比rand(checksum(*))慢不了多少,所以您可能不希望对大型记录集使用它。
初始种子选择:
declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */
如果给定一个种子,你需要选择相同的集合,这似乎是可行的。
这个链接在Orderby(NEWID())和其他方法之间进行了有趣的比较,这些方法用于具有1,700万行和1,300万行的表。
通常,当讨论组中询问如何选择随机行时,会提出NEWID查询;它很简单,适用于小表。
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
但是,当您将newd查询用于大型表时,它有一个很大的缺点。ORDER BY子句会将表中的所有行复制到tempdb数据库中,并对它们进行排序。这会导致两个问题:
排序操作通常有很高的成本。 排序会使用大量磁盘I/O,并且会运行很长时间。 在最坏的情况下,tempdb可能会耗尽空间。在 在最好的情况下,tempdb会占用大量的磁盘空间 如果没有手动收缩命令,将永远不会回收。
您需要的是一种不使用tempdb且不会随着表变大而变慢的随机选择行的方法。这里有一个关于如何做到这一点的新想法:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
这个查询背后的基本思想是,我们希望为表中的每一行生成一个0到99之间的随机数,然后选择那些随机数小于指定百分比值的所有行。在这个例子中,我们希望随机选择大约10%的行;因此,我们选择随机数小于10的所有行。
请在MSDN上阅读全文。
newid()似乎不能在where子句中使用,所以这个解决方案需要一个内部查询:
SELECT *
FROM (
SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
FROM MyTable
) vw
WHERE Rnd % 100 < 10 --10%
这是最初的种子思想和校验和的组合,在我看来,它可以给出适当的随机结果,而不需要NEWID()的代价:
SELECT TOP [number]
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())
我在子查询中使用它,它在子查询中返回我相同的行
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
然后我解决了包括父表变量在哪里
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
Where Mytable.ID>0
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
注意where条件
服务器端使用的处理语言(如PHP, .net等)没有指定,但如果是PHP,获取所需的数字(或所有记录),而不是在查询中随机使用PHP的shuffle函数。我不知道。net是否有等价的函数但如果有的话,请使用。net
ORDER BY RAND()可能会有相当大的性能损失,这取决于涉及多少记录。
select * from table
where id in (
select id from table
order by random()
limit ((select count(*) from table)*55/100))
// to select 55 percent of rows randomly
这是一个更新和改进的抽样形式。它基于与其他一些使用CHECKSUM / BINARY_CHECKSUM和modulus的答案相同的概念。
使用与此类似的实现的原因,而不是其他答案:
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal. Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution. Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID(), such as CHECKSUM(NEWID()) % 100, can never be stable/repeatable. Allows for increased sample precision and reduces introduced statistical errors. The sampling precision can also be tweaked. CHECKSUM only returns an int value. Does not use ORDER BY NEWID(), as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage. Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
缺点/限制:
Slightly slower execution times and using CHECKSUM(*). Using hashbytes, as shown below, adds about 3/4 of a second of overhead per million lines. This is with my data, on my database instance: YMMV. This overhead can be eliminated if using a persisted computed column of the resulting 'well distributed' bigint value from HASHBYTES. Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.
要点如下:
-- Allow a sampling precision [0, 100.0000].
declare @sample_percent decimal(7, 4) = 12.3456
select
t.*
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
@sample_percent = 100
or abs(
-- Choose appropriate identity column(s) for hashbytes input.
-- For demonstration it is assumed to be a UNIQUEIDENTIFIER rowguid column.
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
注:
While SHA1 is technically deprecated since SQL Server 2016, it is both sufficient for the task and is slightly faster than either MD5 or SHA2_256. Use a different hashing function as relevant. If the table already contains a hashed column (with a good distribution), that could potentially be used as well. Conversion of bigint is critical as it allows 2^63 bits of 'random space' to which to apply the modulus operator; this is much more than the 2^31 range from the CHECKSUM result. This reduces the modulus error at the limit, especially as the precision is increased. The sampling precision can be changed as long as the modulus operand and sample percent are multiplied appropriately. In this case, that is 1000 * to account for the 4 digits of precision allowed in @sample_percent. Can multiply the bigint value by RAND() to return a different row sample each run. This effectively changes the permutation of the fixed hash values. If @sample_percent is 100 the query planner can eliminate the slower calculation code entirely. Remember 'parameter sniffing' rules. This allows the code to be left in the query regardless of enabling sampling.
计算@sample_percent,带下限/上限,并在查询中添加TOP“提示”,这在示例用于派生表上下文时可能有用。
-- Approximate max-sample and min-sample ranges.
-- The minimum sample percent should be non-zero within the precision.
declare @max_sample_size int = 3333333
declare @min_sample_percent decimal(7,4) = 0.3333
declare @sample_percent decimal(7,4) -- [0, 100.0000]
declare @sample_size int
-- Get initial count for determining sample percentages.
-- Remember to match the filter conditions with the usage site!
declare @rows int
select @rows = count(1)
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
-- Calculate sample percent and back-calculate actual sample size.
if @rows <= @max_sample_size begin
set @sample_percent = 100
end else begin
set @sample_percent = convert(float, 100) * @max_sample_size / @rows
if @sample_percent < @min_sample_percent
set @sample_percent = @min_sample_percent
end
set @sample_size = ceiling(@rows * @sample_percent / 100)
select *
from ..
join (
-- Not a precise value: if limiting exactly at, can introduce more bias.
-- Using 'option optimize for' avoids this while requiring dynamic SQL.
select top (@sample_size + convert(int, @sample_percent + 5))
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
) sampled
on ..
如果你知道你有大约N行,你想要大约K个随机行,你只需要以K/N的概率拉任意给定的行。使用RAND()函数,它给你一个在0和1之间的公平分布,你可以只做下面的事情,其中PROB = K/N。对我来说效果很快。
SELECT * FROM some_table WHERE RAND() < PROB