我如何才能最好地编写一个查询,从总共600k中随机选择10行?


当前回答

我需要一个查询从一个相当大的表中返回大量随机行。这是我想到的。首先获取最大记录id:

SELECT MAX(id) FROM table_name;

然后将该值代入:

SELECT * FROM table_name WHERE id > FLOOR(RAND() * max) LIMIT n;

Where max is the maximum record id in the table and n is the number of rows you want in your result set. The assumption is that there are no gaps in the record id's although I doubt it would affect the result if there were (haven't tried it though). I also created this stored procedure to be more generic; pass in the table name and number of rows to be returned. I'm running MySQL 5.5.38 on Windows 2008, 32GB, dual 3GHz E5450, and on a table with 17,361,264 rows it's fairly consistent at ~.03 sec / ~11 sec to return 1,000,000 rows. (times are from MySQL Workbench 6.1; you could also use CEIL instead of FLOOR in the 2nd select statement depending on your preference)

DELIMITER $$

USE [schema name] $$

DROP PROCEDURE IF EXISTS `random_rows` $$

CREATE PROCEDURE `random_rows`(IN tab_name VARCHAR(64), IN num_rows INT)
BEGIN

SET @t = CONCAT('SET @max=(SELECT MAX(id) FROM ',tab_name,')');
PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

SET @t = CONCAT(
    'SELECT * FROM ',
    tab_name,
    ' WHERE id>FLOOR(RAND()*@max) LIMIT ',
    num_rows);

PREPARE stmt FROM @t;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END
$$

then

CALL [schema name].random_rows([table name], n);

其他回答

我得到了快速查询(大约0.5秒),cpu很慢,在一个400K寄存器的MySQL数据库中随机选择10行,非缓存2Gb大小。在MySQL中快速选择随机行

$time= microtime_float();

$sql='SELECT COUNT(*) FROM pages';
$rquery= BD_Ejecutar($sql);
list($num_records)=mysql_fetch_row($rquery);
mysql_free_result($rquery);

$sql="SELECT id FROM pages WHERE RAND()*$num_records<20
   ORDER BY RAND() LIMIT 0,10";
$rquery= BD_Ejecutar($sql);
while(list($id)=mysql_fetch_row($rquery)){
    if($id_in) $id_in.=",$id";
    else $id_in="$id";
}
mysql_free_result($rquery);

$sql="SELECT id,url FROM pages WHERE id IN($id_in)";
$rquery= BD_Ejecutar($sql);
while(list($id,$url)=mysql_fetch_row($rquery)){
    logger("$id, $url",1);
}
mysql_free_result($rquery);

$time= microtime_float()-$time;

logger("num_records=$num_records",1);
logger("$id_in",1);
logger("Time elapsed: <b>$time segundos</b>",1);

一个伟大的职位处理几个情况,从简单,到差距,到不均匀与差距。

http://jan.kneschke.de/projects/mysql/order-by-rand/

对于大多数一般情况,你可以这样做:

SELECT name
  FROM random AS r1 JOIN
       (SELECT CEIL(RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1

这假设id的分布是相等的,并且id列表中可能存在间隙。有关更高级的示例,请参阅本文

我知道这不是你想要的,但我将给你的答案是我在一个小网站的生产中使用的。

根据您访问随机值的次数,不值得使用MySQL,因为您将无法缓存答案。我们在那里有一个按钮来访问一个随机页面,用户可以每分钟点击几次,如果他愿意的话。这将导致MySQL的大量使用,至少对我来说,MySQL是优化的最大问题。

我会采用另一种方法,你可以把答案存储在缓存中。调用MySQL:

SELECT min(id) as min, max(id) as max FROM your_table

使用min和max Id,您可以在服务器中计算一个随机数。在python中:

random.randint(min, max)

然后,用你的随机数,你可以在你的表中得到一个随机Id:

SELECT * 
FROM your_table 
WHERE id >= %s 
ORDER BY id ASC 
LIMIT 1

在这种方法中,对数据库执行两次调用,但是可以缓存它们,并且在很长一段时间内不访问数据库,从而提高性能。注意,如果表中有洞,这不是随机的。有超过一行很容易,因为你可以使用python创建Id,并为每行执行一个请求,但由于它们是缓存的,这是可以的。

如果你的表中有太多的洞,你可以尝试同样的方法,但是现在是记录的总数:

SELECT COUNT(*) as total FROM your_table

然后在python中你可以这样写:

random.randint(0, total)

为了获取一个随机结果,你可以使用如下所示的LIMIT:

SELECT * 
FROM your_table 
ORDER BY id ASC 
LIMIT %s, 1

注意它会在X个随机行之后得到1个值。即使您的表中有洞,它也将是完全随机的,但它将为您的数据库带来更多的开销。

SELECT
  * 
FROM
  table_with_600k_rows
WHERE
  RAND( ) 
ORDER BY
  id DESC 
LIMIT 30;

Id是主键,按Id排序, 解释table_with_600k_rows,发现该行不扫描整个表

使用这个查询:

select floor(RAND() * (SELECT MAX(key) FROM table)) from table limit 10

查询时间:0.016秒