我如何才能最好地编写一个查询,从总共600k中随机选择10行?
当前回答
简单的查询,具有出色的性能和工作的差距:
SELECT * FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY RAND() LIMIT 10) as t2 ON t1.id=t2.id
在一个200K表上的这个查询需要0.08秒,而在我的机器上,正常版本(SELECT * FROM tbl ORDER BY RAND() LIMIT 10)需要0.35秒。
这是快速的,因为排序阶段只使用索引ID列。你可以在解释中看到这种行为:
SELECT * FROM tbl ORDER BY RAND() LIMIT 10:
SELECT * FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY RAND() LIMIT 10) AS t2 ON t1.id=t2.id
加权版:https://stackoverflow.com/a/41577458/893432
其他回答
一个伟大的职位处理几个情况,从简单,到差距,到不均匀与差距。
http://jan.kneschke.de/projects/mysql/order-by-rand/
对于大多数一般情况,你可以这样做:
SELECT name
FROM random AS r1 JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
这假设id的分布是相等的,并且id列表中可能存在间隙。有关更高级的示例,请参阅本文
如果有一个自动生成的id,我发现一个很好的方法是使用模运算符'%'。例如,如果您需要70,000条随机记录中的10,000条,您可以简化为每7行中需要1行。这可以在这个查询中简化:
SELECT * FROM
table
WHERE
id %
FLOOR(
(SELECT count(1) FROM table)
/ 10000
) = 0;
如果目标行除以total available的结果不是一个整数,那么你将得到比你要求的更多的行,所以你应该添加一个LIMIT子句来帮助你像这样修剪结果集:
SELECT * FROM
table
WHERE
id %
FLOOR(
(SELECT count(1) FROM table)
/ 10000
) = 0
LIMIT 10000;
这确实需要一个完整的扫描,但它比ORDER BY RAND更快,在我看来,比本文中提到的其他选项更容易理解。另外,如果写入数据库的系统批量创建了一组行,你可能不会得到你所期望的随机结果。
我改进了@Riedsio的答案。这是我在一个有间隙的大型均匀分布表上能找到的最有效的查询(测试从一个有> 2.6B行的表中获得1000个随机行)。
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max := (SELECT MAX(id) FROM table)) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1)
让我来解释一下发生了什么。
(SELECT MAX(id) FROM table) 我在计算并保存最大值。对于非常大的表,每次需要一行时计算MAX(id)都会有轻微的开销 SELECT FLOOR(rand() * @max) + 1 as rand) 获取一个随机id SELECT id FROM table INNER JOIN(… 这就填补了空白。基本上,如果你在间隙中随机选择一个数字,它就会选择下一个id。假设间隙是均匀分布的,这应该不是问题。
进行联合可以帮助您将所有内容放入一个查询中,从而避免进行多个查询。它还可以节省计算MAX(id)的开销。根据您的应用程序,这可能非常重要,也可能无关紧要。
注意,这只获取id,并以随机顺序获取它们。如果你想做更高级的事情,我建议你这样做:
SELECT t.id, t.name -- etc, etc
FROM table t
INNER JOIN (
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max := (SELECT MAX(id) FROM table)) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1) UNION
(SELECT id FROM table INNER JOIN (SELECT FLOOR(RAND() * @max) + 1 as rand) r on id > rand LIMIT 1)
) x ON x.id = t.id
ORDER BY t.id
如何从表中随机选择行:
从这里开始: 在MySQL中随机选择行
对“表扫描”的快速改进是使用索引来获取随机id。
SELECT *
FROM random, (
SELECT id AS sid
FROM random
ORDER BY RAND( )
LIMIT 10
) tmp
WHERE random.id = tmp.sid;
如果你只有一个读请求
将@redsio的答案与一个临时表结合起来(600K并不是很多):
DROP TEMPORARY TABLE IF EXISTS tmp_randorder;
CREATE TABLE tmp_randorder (id int(11) not null auto_increment primary key, data_id int(11));
INSERT INTO tmp_randorder (data_id) select id from datatable;
然后用一个@redsios的版本回答:
SELECT dt.*
FROM
(SELECT (RAND() *
(SELECT MAX(id)
FROM tmp_randorder)) AS id)
AS rnd
INNER JOIN tmp_randorder rndo on rndo.id between rnd.id - 10 and rnd.id + 10
INNER JOIN datatable AS dt on dt.id = rndo.data_id
ORDER BY abs(rndo.id - rnd.id)
LIMIT 1;
如果表比较大,可以先筛选第一部分:
INSERT INTO tmp_randorder (data_id) select id from datatable where rand() < 0.01;
如果你有很多读请求
Version: You could keep the table tmp_randorder persistent, call it datatable_idlist. Recreate that table in certain intervals (day, hour), since it also will get holes. If your table gets really big, you could also refill holes select l.data_id as whole from datatable_idlist l left join datatable dt on dt.id = l.data_id where dt.id is null; Version: Give your Dataset a random_sortorder column either directly in datatable or in a persistent extra table datatable_sortorder. Index that column. Generate a Random-Value in your Application (I'll call it $rand). select l.* from datatable l order by abs(random_sortorder - $rand) desc limit 1;
这个解决方案用最高和最低的random_sortorder来区分“边缘行”,所以在间隔中重新排列它们(一天一次)。
推荐文章
- 修改表:如何将“允许空”属性从不空改为允许空
- 比较两个SQL Server数据库(模式和数据)的最佳工具是什么?
- 在SQL中,如何在范围中“分组”?
- 选项(RECOMPILE)总是更快;为什么?
- 设置数据库从单用户模式到多用户
- oracle中的RANK()和DENSE_RANK()函数有什么区别?
- 警告用户/local/mysql/data目录不属于mysql用户
- 的类型不能用作索引中的键列
- SQL逻辑运算符优先级:And和Or
- 如何检查一个表是否存在于给定的模式中
- 添加一个复合主键
- 如何在SQL Server Management Studio中查看查询历史
- 无法添加或更新子行:外键约束失败
- 生成具有给定(数值)分布的随机数
- 如何从本地机器mysqldump远程数据库