我如何才能最好地编写一个查询,从总共600k中随机选择10行?
当前回答
这里有一个改变游戏规则的方法,可能对许多人有帮助;
我有一个有200k行的表,有连续的id,我需要选择N个随机行,所以我选择根据表中最大的id生成随机值,我创建了这个脚本来找出哪个是最快的操作:
logTime();
query("SELECT COUNT(id) FROM tbl");
logTime();
query("SELECT MAX(id) FROM tbl");
logTime();
query("SELECT id FROM tbl ORDER BY id DESC LIMIT 1");
logTime();
结果如下:
计数:36.8418693542479毫秒 Max: 0.241041183472 ms 订单:0.216960906982毫秒
根据这个结果,order desc是得到最大id的最快操作, 以下是我对这个问题的回答:
SELECT GROUP_CONCAT(n SEPARATOR ',') g FROM (
SELECT FLOOR(RAND() * (
SELECT id FROM tbl ORDER BY id DESC LIMIT 1
)) n FROM tbl LIMIT 10) a
...
SELECT * FROM tbl WHERE id IN ($result);
供您参考:从一个200k表中随机获得10行,我花了1.78 ms(包括php方面的所有操作)
其他回答
简单的查询,具有出色的性能和工作的差距:
SELECT * FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY RAND() LIMIT 10) as t2 ON t1.id=t2.id
在一个200K表上的这个查询需要0.08秒,而在我的机器上,正常版本(SELECT * FROM tbl ORDER BY RAND() LIMIT 10)需要0.35秒。
这是快速的,因为排序阶段只使用索引ID列。你可以在解释中看到这种行为:
SELECT * FROM tbl ORDER BY RAND() LIMIT 10:
SELECT * FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY RAND() LIMIT 10) AS t2 ON t1.id=t2.id
加权版:https://stackoverflow.com/a/41577458/893432
所有最好的答案都已经贴出来了(主要是那些引用了http://jan.kneschke.de/projects/mysql/order-by-rand/的链接)。
I want to pinpoint another speed-up possibility - caching. Think of why you need to get random rows. Probably you want display some random post or random ad on a website. If you are getting 100 req/s, is it really needed that each visitor gets random rows? Usually it is completely fine to cache these X random rows for 1 second (or even 10 seconds). It doesn't matter if 100 unique visitors in the same 1 second get the same random posts, because the next second another 100 visitors will get different set of posts.
当使用这种缓存时,你也可以使用一些较慢的解决方案来获取随机数据,因为不管你的req/s如何,它每秒只会从MySQL中获取一次。
如果你只有一个读请求
将@redsio的答案与一个临时表结合起来(600K并不是很多):
DROP TEMPORARY TABLE IF EXISTS tmp_randorder;
CREATE TABLE tmp_randorder (id int(11) not null auto_increment primary key, data_id int(11));
INSERT INTO tmp_randorder (data_id) select id from datatable;
然后用一个@redsios的版本回答:
SELECT dt.*
FROM
(SELECT (RAND() *
(SELECT MAX(id)
FROM tmp_randorder)) AS id)
AS rnd
INNER JOIN tmp_randorder rndo on rndo.id between rnd.id - 10 and rnd.id + 10
INNER JOIN datatable AS dt on dt.id = rndo.data_id
ORDER BY abs(rndo.id - rnd.id)
LIMIT 1;
如果表比较大,可以先筛选第一部分:
INSERT INTO tmp_randorder (data_id) select id from datatable where rand() < 0.01;
如果你有很多读请求
Version: You could keep the table tmp_randorder persistent, call it datatable_idlist. Recreate that table in certain intervals (day, hour), since it also will get holes. If your table gets really big, you could also refill holes select l.data_id as whole from datatable_idlist l left join datatable dt on dt.id = l.data_id where dt.id is null; Version: Give your Dataset a random_sortorder column either directly in datatable or in a persistent extra table datatable_sortorder. Index that column. Generate a Random-Value in your Application (I'll call it $rand). select l.* from datatable l order by abs(random_sortorder - $rand) desc limit 1;
这个解决方案用最高和最低的random_sortorder来区分“边缘行”,所以在间隔中重新排列它们(一天一次)。
我得到了快速查询(大约0.5秒),cpu很慢,在一个400K寄存器的MySQL数据库中随机选择10行,非缓存2Gb大小。在MySQL中快速选择随机行
$time= microtime_float();
$sql='SELECT COUNT(*) FROM pages';
$rquery= BD_Ejecutar($sql);
list($num_records)=mysql_fetch_row($rquery);
mysql_free_result($rquery);
$sql="SELECT id FROM pages WHERE RAND()*$num_records<20
ORDER BY RAND() LIMIT 0,10";
$rquery= BD_Ejecutar($sql);
while(list($id)=mysql_fetch_row($rquery)){
if($id_in) $id_in.=",$id";
else $id_in="$id";
}
mysql_free_result($rquery);
$sql="SELECT id,url FROM pages WHERE id IN($id_in)";
$rquery= BD_Ejecutar($sql);
while(list($id,$url)=mysql_fetch_row($rquery)){
logger("$id, $url",1);
}
mysql_free_result($rquery);
$time= microtime_float()-$time;
logger("num_records=$num_records",1);
logger("$id_in",1);
logger("Time elapsed: <b>$time segundos</b>",1);
我需要一个查询从一个相当大的表中返回大量随机行。这是我想到的。首先获取最大记录id:
SELECT MAX(id) FROM table_name;
然后将该值代入:
SELECT * FROM table_name WHERE id > FLOOR(RAND() * max) LIMIT n;
Where max is the maximum record id in the table and n is the number of rows you want in your result set. The assumption is that there are no gaps in the record id's although I doubt it would affect the result if there were (haven't tried it though). I also created this stored procedure to be more generic; pass in the table name and number of rows to be returned. I'm running MySQL 5.5.38 on Windows 2008, 32GB, dual 3GHz E5450, and on a table with 17,361,264 rows it's fairly consistent at ~.03 sec / ~11 sec to return 1,000,000 rows. (times are from MySQL Workbench 6.1; you could also use CEIL instead of FLOOR in the 2nd select statement depending on your preference)
DELIMITER $$
USE [schema name] $$
DROP PROCEDURE IF EXISTS `random_rows` $$
CREATE PROCEDURE `random_rows`(IN tab_name VARCHAR(64), IN num_rows INT)
BEGIN
SET @t = CONCAT('SET @max=(SELECT MAX(id) FROM ',tab_name,')');
PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
SET @t = CONCAT(
'SELECT * FROM ',
tab_name,
' WHERE id>FLOOR(RAND()*@max) LIMIT ',
num_rows);
PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END
$$
then
CALL [schema name].random_rows([table name], n);
推荐文章
- 准备好的语句如何防止SQL注入攻击?
- 如何改变一个列的数据类型在PostgreSQL表?
- 在PHP中插入MySQL时转义单引号
- 随机字符串生成器返回相同的字符串
- 如何结合日期从一个字段与时间从另一个字段- MS SQL Server
- 我如何得到“id”后插入到MySQL数据库与Python?
- 如何在不知道其名称的情况下删除SQL默认约束?
- SQL语法区分大小写吗?
- MySQL工作台:如何保持连接活动
- 'create_date'时间戳字段的默认值无效
- 左连接与Where子句
- 如何使用实体框架只更新一个字段?
- 在表变量上创建索引
- 为什么历史上人们使用255而不是256作为数据库字段大小?
- 如何选择记录从过去24小时使用SQL?