我如何才能最好地编写一个查询,从总共600k中随机选择10行?


当前回答

我需要一个查询从一个相当大的表中返回大量随机行。这是我想到的。首先获取最大记录id:

SELECT MAX(id) FROM table_name;

然后将该值代入:

SELECT * FROM table_name WHERE id > FLOOR(RAND() * max) LIMIT n;

Where max is the maximum record id in the table and n is the number of rows you want in your result set. The assumption is that there are no gaps in the record id's although I doubt it would affect the result if there were (haven't tried it though). I also created this stored procedure to be more generic; pass in the table name and number of rows to be returned. I'm running MySQL 5.5.38 on Windows 2008, 32GB, dual 3GHz E5450, and on a table with 17,361,264 rows it's fairly consistent at ~.03 sec / ~11 sec to return 1,000,000 rows. (times are from MySQL Workbench 6.1; you could also use CEIL instead of FLOOR in the 2nd select statement depending on your preference)

DELIMITER $$

USE [schema name] $$

DROP PROCEDURE IF EXISTS `random_rows` $$

CREATE PROCEDURE `random_rows`(IN tab_name VARCHAR(64), IN num_rows INT)
BEGIN

SET @t = CONCAT('SET @max=(SELECT MAX(id) FROM ',tab_name,')');
PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

SET @t = CONCAT(
    'SELECT * FROM ',
    tab_name,
    ' WHERE id>FLOOR(RAND()*@max) LIMIT ',
    num_rows);

PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END
$$

then

CALL [schema name].random_rows([table name], n);

其他回答

另一个简单的解决方案是对行进行排名,并随机获取其中之一,有了这个解决方案,你将不需要在表中有任何基于“Id”的列。

SELECT d.* FROM (
SELECT  t.*,  @rownum := @rownum + 1 AS rank
FROM mytable AS t,
    (SELECT @rownum := 0) AS r,
    (SELECT @cnt := (SELECT RAND() * (SELECT COUNT(*) FROM mytable))) AS n
) d WHERE rank >= @cnt LIMIT 10;

您可以根据需要更改限制值,以便访问尽可能多的行,但大多数情况下是连续的值。

然而,如果你不想要连续的随机值,那么你可以获取一个更大的样本并从中随机选择。就像……

SELECT * FROM (
SELECT d.* FROM (
    SELECT  c.*,  @rownum := @rownum + 1 AS rank
    FROM buildbrain.`commits` AS c,
        (SELECT @rownum := 0) AS r,
        (SELECT @cnt := (SELECT RAND() * (SELECT COUNT(*) FROM buildbrain.`commits`))) AS rnd
) d 
WHERE rank >= @cnt LIMIT 10000 
) t ORDER BY RAND() LIMIT 10;
SELECT
  * 
FROM
  table_with_600k_rows
WHERE
  RAND( ) 
ORDER BY
  id DESC 
LIMIT 30;

Id是主键,按Id排序, 解释table_with_600k_rows,发现该行不扫描整个表

这非常快,而且是100%随机的,即使你有间隙。

将SELECT Count(*)中可用的行数计算为rows FROM TABLE 选择10个不同的随机数a_1,a_2,…,a_10在0到x之间 SELECT * FROM TABLE LIMIT 1 offset a_i for i=1,…,10

我在Bill Karwin的《SQL反模式》一书中发现了这个破解方法。

我使用了Riedsio发布的http://jan.kneschke.de/projects/mysql/order-by-rand/(我使用了返回一个或多个随机值的存储过程的情况):

   DROP TEMPORARY TABLE IF EXISTS rands;
   CREATE TEMPORARY TABLE rands ( rand_id INT );

    loop_me: LOOP
        IF cnt < 1 THEN
          LEAVE loop_me;
        END IF;

        INSERT INTO rands
           SELECT r1.id
             FROM random AS r1 JOIN
                  (SELECT (RAND() *
                                (SELECT MAX(id)
                                   FROM random)) AS id)
                   AS r2
            WHERE r1.id >= r2.id
            ORDER BY r1.id ASC
            LIMIT 1;

        SET cnt = cnt - 1;
      END LOOP loop_me;

在这篇文章中,他通过维护一个表(使用触发器等)解决了id中的间隙导致不那么随机的结果的问题。参见文章); 我通过向表中添加另一列来解决这个问题,用连续的数字填充,从1开始(编辑:此列添加到运行时由子查询创建的临时表中,不影响永久表):

   DROP TEMPORARY TABLE IF EXISTS rands;
   CREATE TEMPORARY TABLE rands ( rand_id INT );

    loop_me: LOOP
        IF cnt < 1 THEN
          LEAVE loop_me;
        END IF;

        SET @no_gaps_id := 0;

        INSERT INTO rands
           SELECT r1.id
             FROM (SELECT id, @no_gaps_id := @no_gaps_id + 1 AS no_gaps_id FROM random) AS r1 JOIN
                  (SELECT (RAND() *
                                (SELECT COUNT(*)
                                   FROM random)) AS id)
                   AS r2
            WHERE r1.no_gaps_id >= r2.id
            ORDER BY r1.no_gaps_id ASC
            LIMIT 1;

        SET cnt = cnt - 1;
      END LOOP loop_me;

在文章中,我可以看到他花了很大的精力来优化代码;我不知道我的改变是否/有多大影响性能,但对我来说非常好。

这里有一个改变游戏规则的方法,可能对许多人有帮助;

我有一个有200k行的表,有连续的id,我需要选择N个随机行,所以我选择根据表中最大的id生成随机值,我创建了这个脚本来找出哪个是最快的操作:

logTime();
query("SELECT COUNT(id) FROM tbl");
logTime();
query("SELECT MAX(id) FROM tbl");
logTime();
query("SELECT id FROM tbl ORDER BY id DESC LIMIT 1");
logTime();

结果如下:

计数:36.8418693542479毫秒 Max: 0.241041183472 ms 订单:0.216960906982毫秒

根据这个结果,order desc是得到最大id的最快操作, 以下是我对这个问题的回答:

SELECT GROUP_CONCAT(n SEPARATOR ',') g FROM (
    SELECT FLOOR(RAND() * (
        SELECT id FROM tbl ORDER BY id DESC LIMIT 1
    )) n FROM tbl LIMIT 10) a

...
SELECT * FROM tbl WHERE id IN ($result);

供您参考:从一个200k表中随机获得10行,我花了1.78 ms(包括php方面的所有操作)