我如何才能最好地编写一个查询,从总共600k中随机选择10行?
当前回答
我改进了@Riedsio的答案。这是我在一个有间隙的大型均匀分布表上能找到的最有效的查询(测试从一个有> 2.6B行的表中获得1000个随机行)。
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max := (SELECT MAX(id) FROM table)) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1)
让我来解释一下发生了什么。
(SELECT MAX(id) FROM table) 我在计算并保存最大值。对于非常大的表,每次需要一行时计算MAX(id)都会有轻微的开销 SELECT FLOOR(rand() * @max) + 1 as rand) 获取一个随机id SELECT id FROM table INNER JOIN(… 这就填补了空白。基本上,如果你在间隙中随机选择一个数字,它就会选择下一个id。假设间隙是均匀分布的,这应该不是问题。
进行联合可以帮助您将所有内容放入一个查询中,从而避免进行多个查询。它还可以节省计算MAX(id)的开销。根据您的应用程序,这可能非常重要,也可能无关紧要。
注意,这只获取id,并以随机顺序获取它们。如果你想做更高级的事情,我建议你这样做:
SELECT t.id, t.name -- etc, etc
FROM table t
INNER JOIN (
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max := (SELECT MAX(id) FROM table)) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1)
) x ON x.id = t.id
ORDER BY t.id
其他回答
所有最好的答案都已经贴出来了(主要是那些引用了http://jan.kneschke.de/projects/mysql/order-by-rand/的链接)。
I want to pinpoint another speed-up possibility - caching. Think of why you need to get random rows. Probably you want display some random post or random ad on a website. If you are getting 100 req/s, is it really needed that each visitor gets random rows? Usually it is completely fine to cache these X random rows for 1 second (or even 10 seconds). It doesn't matter if 100 unique visitors in the same 1 second get the same random posts, because the next second another 100 visitors will get different set of posts.
当使用这种缓存时,你也可以使用一些较慢的解决方案来获取随机数据,因为不管你的req/s如何,它每秒只会从MySQL中获取一次。
一个伟大的职位处理几个情况,从简单,到差距,到不均匀与差距。
http://jan.kneschke.de/projects/mysql/order-by-rand/
对于大多数一般情况,你可以这样做:
SELECT name
FROM random AS r1 JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
这假设id的分布是相等的,并且id列表中可能存在间隙。有关更高级的示例,请参阅本文
使用这个查询:
select floor(RAND() * (SELECT MAX(key) FROM table)) from table limit 10
查询时间:0.016秒
从书中:
使用偏移量选择随机行
这是另一种避免前面提到的问题的技术 替代方法是统计数据集中的行数并返回一个随机值 0到计数之间的数字。然后用这个数字作为抵消 查询数据集时
$rand = "SELECT ROUND(RAND() * (SELECT COUNT(*) FROM Bugs))";
$offset = $pdo->query($rand)->fetch(PDO::FETCH_ASSOC);
$sql = "SELECT * FROM Bugs LIMIT 1 OFFSET :offset";
$stmt = $pdo->prepare($sql);
$stmt->execute( $offset );
$rand_bug = $stmt->fetch();
在不能假定连续键值和时使用此解决方案 您需要确保每一行都有均等的机会被选中。
如果有一个自动生成的id,我发现一个很好的方法是使用模运算符'%'。例如,如果您需要70,000条随机记录中的10,000条,您可以简化为每7行中需要1行。这可以在这个查询中简化:
SELECT * FROM
table
WHERE
id %
FLOOR(
(SELECT count(1) FROM table)
/ 10000
) = 0;
如果目标行除以total available的结果不是一个整数,那么你将得到比你要求的更多的行,所以你应该添加一个LIMIT子句来帮助你像这样修剪结果集:
SELECT * FROM
table
WHERE
id %
FLOOR(
(SELECT count(1) FROM table)
/ 10000
) = 0
LIMIT 10000;
这确实需要一个完整的扫描,但它比ORDER BY RAND更快,在我看来,比本文中提到的其他选项更容易理解。另外,如果写入数据库的系统批量创建了一组行,你可能不会得到你所期望的随机结果。
推荐文章
- 将值从同一表中的一列复制到另一列
- GROUP BY with MAX(DATE)
- 删除id与其他表不匹配的sql行
- 等价的限制和偏移SQL Server?
- MySQL CPU使用率高
- INT和VARCHAR主键之间有真正的性能差异吗?
- 拒绝访问;您需要(至少一个)SUPER特权来执行此操作
- 为什么我不能在DELETE语句中使用别名?
- 在SQL Server Management Studio中保存带有标题的结果
- 从存储引擎得到错误28
- "where 1=1"语句
- 如何选择一个记录和更新它,与一个单一的查询集在Django?
- 多语句表值函数vs内联表值函数
- 如何从Oracle的表中获取列名?
- 可能做MySQL外键的两个可能的表之一?