我有一个SQL Server表,其中有大约50,000行。我想随机选择大约5000行。我想到了一种复杂的方法,创建一个带有“随机数”列的临时表,将我的表复制到其中,循环遍历临时表并使用RAND()更新每一行,然后从该表中选择随机数列< 0.1的列。我正在寻找一种更简单的方法,如果可能的话,在一个单一的声明中。
我有一个SQL Server表,其中有大约50,000行。我想随机选择大约5000行。我想到了一种复杂的方法,创建一个带有“随机数”列的临时表,将我的表复制到其中,循环遍历临时表并使用RAND()更新每一行,然后从该表中选择随机数列< 0.1的列。我正在寻找一种更简单的方法,如果可能的话,在一个单一的声明中。
select top 10 percent *
from table_name
order by rand(checksum(*))
select top 10 percent *
from table_name
order by newid()
declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */
服务器端使用的处理语言(如PHP, .net等)没有指定,但如果是PHP,获取所需的数字(或所有记录),而不是在查询中随机使用PHP的shuffle函数。我不知道。net是否有等价的函数但如果有的话,请使用。net
ORDER BY RAND()可能会有相当大的性能损失,这取决于涉及多少记录。
根据您的需要,TABLESAMPLE将为您提供几乎相同的随机和更好的性能。 这在MS SQL server 2005及更高版本上可用。
select top 1 percent * from [tablename] order by newid()
select * from [tablename] tablesample(1 percent)
这是一个更新和改进的抽样形式。它基于与其他一些使用CHECKSUM / BINARY_CHECKSUM和modulus的答案相同的概念。
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal. Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution. Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID(), such as CHECKSUM(NEWID()) % 100, can never be stable/repeatable. Allows for increased sample precision and reduces introduced statistical errors. The sampling precision can also be tweaked. CHECKSUM only returns an int value. Does not use ORDER BY NEWID(), as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage. Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Slightly slower execution times and using CHECKSUM(*). Using hashbytes, as shown below, adds about 3/4 of a second of overhead per million lines. This is with my data, on my database instance: YMMV. This overhead can be eliminated if using a persisted computed column of the resulting 'well distributed' bigint value from HASHBYTES. Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.
-- Allow a sampling precision [0, 100.0000].
declare @sample_percent decimal(7, 4) = 12.3456
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
@sample_percent = 100
or abs(
-- Choose appropriate identity column(s) for hashbytes input.
-- For demonstration it is assumed to be a UNIQUEIDENTIFIER rowguid column.
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
While SHA1 is technically deprecated since SQL Server 2016, it is both sufficient for the task and is slightly faster than either MD5 or SHA2_256. Use a different hashing function as relevant. If the table already contains a hashed column (with a good distribution), that could potentially be used as well. Conversion of bigint is critical as it allows 2^63 bits of 'random space' to which to apply the modulus operator; this is much more than the 2^31 range from the CHECKSUM result. This reduces the modulus error at the limit, especially as the precision is increased. The sampling precision can be changed as long as the modulus operand and sample percent are multiplied appropriately. In this case, that is 1000 * to account for the 4 digits of precision allowed in @sample_percent. Can multiply the bigint value by RAND() to return a different row sample each run. This effectively changes the permutation of the fixed hash values. If @sample_percent is 100 the query planner can eliminate the slower calculation code entirely. Remember 'parameter sniffing' rules. This allows the code to be left in the query regardless of enabling sampling.
-- Approximate max-sample and min-sample ranges.
-- The minimum sample percent should be non-zero within the precision.
declare @max_sample_size int = 3333333
declare @min_sample_percent decimal(7,4) = 0.3333
declare @sample_percent decimal(7,4) -- [0, 100.0000]
declare @sample_size int
-- Get initial count for determining sample percentages.
-- Remember to match the filter conditions with the usage site!
declare @rows int
select @rows = count(1)
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
-- Calculate sample percent and back-calculate actual sample size.
if @rows <= @max_sample_size begin
set @sample_percent = 100
end else begin
set @sample_percent = convert(float, 100) * @max_sample_size / @rows
if @sample_percent < @min_sample_percent
set @sample_percent = @min_sample_percent
set @sample_size = ceiling(@rows * @sample_percent / 100)
select *
from ..
join (
-- Not a precise value: if limiting exactly at, can introduce more bias.
-- Using 'option optimize for' avoids this while requiring dynamic SQL.
select top (@sample_size + convert(int, @sample_percent + 5))
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
) sampled
on ..
select * from table
where id in (
select id from table
order by random()
limit ((select count(*) from table)*55/100))
// to select 55 percent of rows randomly
如果你知道你有大约N行,你想要大约K个随机行,你只需要以K/N的概率拉任意给定的行。使用RAND()函数,它给你一个在0和1之间的公平分布,你可以只做下面的事情,其中PROB = K/N。对我来说效果很快。