我如何才能最好地编写一个查询,从总共600k中随机选择10行?


当前回答

SELECT column FROM table
ORDER BY RAND()
LIMIT 10

这不是有效的解决方案,但确实有效

其他回答

这里有一个改变游戏规则的方法,可能对许多人有帮助;

我有一个有200k行的表,有连续的id,我需要选择N个随机行,所以我选择根据表中最大的id生成随机值,我创建了这个脚本来找出哪个是最快的操作:

logTime();
query("SELECT COUNT(id) FROM tbl");
logTime();
query("SELECT MAX(id) FROM tbl");
logTime();
query("SELECT id FROM tbl ORDER BY id DESC LIMIT 1");
logTime();

结果如下:

计数:36.8418693542479毫秒 Max: 0.241041183472 ms 订单:0.216960906982毫秒

根据这个结果,order desc是得到最大id的最快操作, 以下是我对这个问题的回答:

SELECT GROUP_CONCAT(n SEPARATOR ',') g FROM (
    SELECT FLOOR(RAND() * (
        SELECT id FROM tbl ORDER BY id DESC LIMIT 1
    )) n FROM tbl LIMIT 10) a

...
SELECT * FROM tbl WHERE id IN ($result);

供您参考:从一个200k表中随机获得10行,我花了1.78 ms(包括php方面的所有操作)

所有最好的答案都已经贴出来了(主要是那些引用了http://jan.kneschke.de/projects/mysql/order-by-rand/的链接)。

I want to pinpoint another speed-up possibility - caching. Think of why you need to get random rows. Probably you want display some random post or random ad on a website. If you are getting 100 req/s, is it really needed that each visitor gets random rows? Usually it is completely fine to cache these X random rows for 1 second (or even 10 seconds). It doesn't matter if 100 unique visitors in the same 1 second get the same random posts, because the next second another 100 visitors will get different set of posts.

当使用这种缓存时,你也可以使用一些较慢的解决方案来获取随机数据,因为不管你的req/s如何,它每秒只会从MySQL中获取一次。

SELECT
  * 
FROM
  table_with_600k_rows
WHERE
  RAND( ) 
ORDER BY
  id DESC 
LIMIT 30;

Id是主键,按Id排序, 解释table_with_600k_rows,发现该行不扫描整个表

我需要一个查询从一个相当大的表中返回大量随机行。这是我想到的。首先获取最大记录id:

SELECT MAX(id) FROM table_name;

然后将该值代入:

SELECT * FROM table_name WHERE id > FLOOR(RAND() * max) LIMIT n;

Where max is the maximum record id in the table and n is the number of rows you want in your result set. The assumption is that there are no gaps in the record id's although I doubt it would affect the result if there were (haven't tried it though). I also created this stored procedure to be more generic; pass in the table name and number of rows to be returned. I'm running MySQL 5.5.38 on Windows 2008, 32GB, dual 3GHz E5450, and on a table with 17,361,264 rows it's fairly consistent at ~.03 sec / ~11 sec to return 1,000,000 rows. (times are from MySQL Workbench 6.1; you could also use CEIL instead of FLOOR in the 2nd select statement depending on your preference)

DELIMITER $$

USE [schema name] $$

DROP PROCEDURE IF EXISTS `random_rows` $$

CREATE PROCEDURE `random_rows`(IN tab_name VARCHAR(64), IN num_rows INT)
BEGIN

SET @t = CONCAT('SET @max=(SELECT MAX(id) FROM ',tab_name,')');
PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

SET @t = CONCAT(
    'SELECT * FROM ',
    tab_name,
    ' WHERE id>FLOOR(RAND()*@max) LIMIT ',
    num_rows);

PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END
$$

then

CALL [schema name].random_rows([table name], n);

如果有一个自动生成的id,我发现一个很好的方法是使用模运算符'%'。例如,如果您需要70,000条随机记录中的10,000条,您可以简化为每7行中需要1行。这可以在这个查询中简化:

SELECT * FROM 
    table 
WHERE 
    id % 
    FLOOR(
        (SELECT count(1) FROM table) 
        / 10000
    ) = 0;

如果目标行除以total available的结果不是一个整数,那么你将得到比你要求的更多的行,所以你应该添加一个LIMIT子句来帮助你像这样修剪结果集:

SELECT * FROM 
    table 
WHERE 
    id % 
    FLOOR(
        (SELECT count(1) FROM table) 
        / 10000
    ) = 0
LIMIT 10000;

这确实需要一个完整的扫描,但它比ORDER BY RAND更快,在我看来,比本文中提到的其他选项更容易理解。另外,如果写入数据库的系统批量创建了一组行,你可能不会得到你所期望的随机结果。