我如何才能最好地编写一个查询,从总共600k中随机选择10行?


当前回答

如果你只有一个读请求

将@redsio的答案与一个临时表结合起来(600K并不是很多):

DROP TEMPORARY TABLE IF EXISTS tmp_randorder;
CREATE TABLE tmp_randorder (id int(11) not null auto_increment primary key, data_id int(11));
INSERT INTO tmp_randorder (data_id) select id from datatable;

然后用一个@redsios的版本回答:

SELECT dt.*
FROM
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM tmp_randorder)) AS id)
        AS rnd
 INNER JOIN tmp_randorder rndo on rndo.id between rnd.id - 10 and rnd.id + 10
 INNER JOIN datatable AS dt on dt.id = rndo.data_id
 ORDER BY abs(rndo.id - rnd.id)
 LIMIT 1;

如果表比较大,可以先筛选第一部分:

INSERT INTO tmp_randorder (data_id) select id from datatable where rand() < 0.01;

如果你有很多读请求

Version: You could keep the table tmp_randorder persistent, call it datatable_idlist. Recreate that table in certain intervals (day, hour), since it also will get holes. If your table gets really big, you could also refill holes select l.data_id as whole from datatable_idlist l left join datatable dt on dt.id = l.data_id where dt.id is null; Version: Give your Dataset a random_sortorder column either directly in datatable or in a persistent extra table datatable_sortorder. Index that column. Generate a Random-Value in your Application (I'll call it $rand). select l.* from datatable l order by abs(random_sortorder - $rand) desc limit 1;

这个解决方案用最高和最低的random_sortorder来区分“边缘行”,所以在间隔中重新排列它们(一天一次)。

其他回答

一个伟大的职位处理几个情况,从简单,到差距,到不均匀与差距。

http://jan.kneschke.de/projects/mysql/order-by-rand/

对于大多数一般情况,你可以这样做:

SELECT name
  FROM random AS r1 JOIN
       (SELECT CEIL(RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1

这假设id的分布是相等的,并且id列表中可能存在间隙。有关更高级的示例,请参阅本文

它是非常简单的单行查询。

SELECT * FROM Table_Name ORDER BY RAND() LIMIT 0,10;

我需要一个查询从一个相当大的表中返回大量随机行。这是我想到的。首先获取最大记录id:

SELECT MAX(id) FROM table_name;

然后将该值代入:

SELECT * FROM table_name WHERE id > FLOOR(RAND() * max) LIMIT n;

Where max is the maximum record id in the table and n is the number of rows you want in your result set. The assumption is that there are no gaps in the record id's although I doubt it would affect the result if there were (haven't tried it though). I also created this stored procedure to be more generic; pass in the table name and number of rows to be returned. I'm running MySQL 5.5.38 on Windows 2008, 32GB, dual 3GHz E5450, and on a table with 17,361,264 rows it's fairly consistent at ~.03 sec / ~11 sec to return 1,000,000 rows. (times are from MySQL Workbench 6.1; you could also use CEIL instead of FLOOR in the 2nd select statement depending on your preference)

DELIMITER $$

USE [schema name] $$

DROP PROCEDURE IF EXISTS `random_rows` $$

CREATE PROCEDURE `random_rows`(IN tab_name VARCHAR(64), IN num_rows INT)
BEGIN

SET @t = CONCAT('SET @max=(SELECT MAX(id) FROM ',tab_name,')');
PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

SET @t = CONCAT(
    'SELECT * FROM ',
    tab_name,
    ' WHERE id>FLOOR(RAND()*@max) LIMIT ',
    num_rows);

PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END
$$

then

CALL [schema name].random_rows([table name], n);

您可以轻松地使用带限制的随机偏移量

PREPARE stm from 'select * from table limit 10 offset ?';
SET @total = (select count(*) from table);
SET @_offset = FLOOR(RAND() * @total);
EXECUTE stm using @_offset;

您还可以像这样应用where子句

PREPARE stm from 'select * from table where available=true limit 10 offset ?';
SET @total = (select count(*) from table where available=true);
SET @_offset = FLOOR(RAND() * @total);
EXECUTE stm using @_offset;

在600,000行(700MB)表查询执行上的测试花费了大约0.016秒的硬盘驱动器时间。

EDIT:偏移量可能取接近表末尾的值,这将导致select语句返回更少的行(或者可能只有一行),为了避免这种情况,我们可以在声明偏移量后再次检查,如下所示

SET @rows_count = 10;
PREPARE stm from "select * from table where available=true limit ? offset ?";
SET @total = (select count(*) from table where available=true);
SET @_offset = FLOOR(RAND() * @total);
SET @_offset = (SELECT IF(@total-@_offset<@rows_count,@_offset-@rows_count,@_offset));
SET @_offset = (SELECT IF(@_offset<0,0,@_offset));
EXECUTE stm using @rows_count,@_offset;

以下内容应该是快速的,公正的,独立于id列的。但是,它不能保证返回的行数与请求的行数匹配。

SELECT *
FROM t
WHERE RAND() < (SELECT 10 / COUNT(*) FROM t)

解释:假设你想要100行中的10行,那么每一行都有1/10的概率被选中,这可以通过WHERE RAND() < 0.1来实现。这种方法不能保证有10行;但是如果查询运行了足够多的次数,那么每次执行的平均行数将在10左右,并且表中的每一行都将被均匀地选择。