我如何才能最好地编写一个查询,从总共600k中随机选择10行?


当前回答

以下内容应该是快速的,公正的,独立于id列的。但是,它不能保证返回的行数与请求的行数匹配。

SELECT *
FROM t
WHERE RAND() < (SELECT 10 / COUNT(*) FROM t)

解释:假设你想要100行中的10行,那么每一行都有1/10的概率被选中,这可以通过WHERE RAND() < 0.1来实现。这种方法不能保证有10行;但是如果查询运行了足够多的次数,那么每次执行的平均行数将在10左右,并且表中的每一行都将被均匀地选择。

其他回答

如何从表中随机选择行:

从这里开始: 在MySQL中随机选择行

对“表扫描”的快速改进是使用索引来获取随机id。

SELECT *
FROM random, (
        SELECT id AS sid
        FROM random
        ORDER BY RAND( )
        LIMIT 10
    ) tmp
WHERE random.id = tmp.sid;

所有最好的答案都已经贴出来了(主要是那些引用了http://jan.kneschke.de/projects/mysql/order-by-rand/的链接)。

I want to pinpoint another speed-up possibility - caching. Think of why you need to get random rows. Probably you want display some random post or random ad on a website. If you are getting 100 req/s, is it really needed that each visitor gets random rows? Usually it is completely fine to cache these X random rows for 1 second (or even 10 seconds). It doesn't matter if 100 unique visitors in the same 1 second get the same random posts, because the next second another 100 visitors will get different set of posts.

当使用这种缓存时,你也可以使用一些较慢的解决方案来获取随机数据,因为不管你的req/s如何,它每秒只会从MySQL中获取一次。

我使用了Riedsio发布的http://jan.kneschke.de/projects/mysql/order-by-rand/(我使用了返回一个或多个随机值的存储过程的情况):

   DROP TEMPORARY TABLE IF EXISTS rands;
   CREATE TEMPORARY TABLE rands ( rand_id INT );

    loop_me: LOOP
        IF cnt < 1 THEN
          LEAVE loop_me;
        END IF;

        INSERT INTO rands
           SELECT r1.id
             FROM random AS r1 JOIN
                  (SELECT (RAND() *
                                (SELECT MAX(id)
                                   FROM random)) AS id)
                   AS r2
            WHERE r1.id >= r2.id
            ORDER BY r1.id ASC
            LIMIT 1;

        SET cnt = cnt - 1;
      END LOOP loop_me;

在这篇文章中,他通过维护一个表(使用触发器等)解决了id中的间隙导致不那么随机的结果的问题。参见文章); 我通过向表中添加另一列来解决这个问题,用连续的数字填充,从1开始(编辑:此列添加到运行时由子查询创建的临时表中,不影响永久表):

   DROP TEMPORARY TABLE IF EXISTS rands;
   CREATE TEMPORARY TABLE rands ( rand_id INT );

    loop_me: LOOP
        IF cnt < 1 THEN
          LEAVE loop_me;
        END IF;

        SET @no_gaps_id := 0;

        INSERT INTO rands
           SELECT r1.id
             FROM (SELECT id, @no_gaps_id := @no_gaps_id + 1 AS no_gaps_id FROM random) AS r1 JOIN
                  (SELECT (RAND() *
                                (SELECT COUNT(*)
                                   FROM random)) AS id)
                   AS r2
            WHERE r1.no_gaps_id >= r2.id
            ORDER BY r1.no_gaps_id ASC
            LIMIT 1;

        SET cnt = cnt - 1;
      END LOOP loop_me;

在文章中,我可以看到他花了很大的精力来优化代码;我不知道我的改变是否/有多大影响性能,但对我来说非常好。

我得到了快速查询(大约0.5秒),cpu很慢,在一个400K寄存器的MySQL数据库中随机选择10行,非缓存2Gb大小。在MySQL中快速选择随机行

$time= microtime_float();

$sql='SELECT COUNT(*) FROM pages';
$rquery= BD_Ejecutar($sql);
list($num_records)=mysql_fetch_row($rquery);
mysql_free_result($rquery);

$sql="SELECT id FROM pages WHERE RAND()*$num_records<20
   ORDER BY RAND() LIMIT 0,10";
$rquery= BD_Ejecutar($sql);
while(list($id)=mysql_fetch_row($rquery)){
    if($id_in) $id_in.=",$id";
    else $id_in="$id";
}
mysql_free_result($rquery);

$sql="SELECT id,url FROM pages WHERE id IN($id_in)";
$rquery= BD_Ejecutar($sql);
while(list($id,$url)=mysql_fetch_row($rquery)){
    logger("$id, $url",1);
}
mysql_free_result($rquery);

$time= microtime_float()-$time;

logger("num_records=$num_records",1);
logger("$id_in",1);
logger("Time elapsed: <b>$time segundos</b>",1);

从书中:

使用偏移量选择随机行

这是另一种避免前面提到的问题的技术 替代方法是统计数据集中的行数并返回一个随机值 0到计数之间的数字。然后用这个数字作为抵消 查询数据集时

$rand = "SELECT ROUND(RAND() * (SELECT COUNT(*) FROM Bugs))";
$offset = $pdo->query($rand)->fetch(PDO::FETCH_ASSOC);
$sql = "SELECT * FROM Bugs LIMIT 1 OFFSET :offset";
$stmt = $pdo->prepare($sql);
$stmt->execute( $offset );
$rand_bug = $stmt->fetch();

在不能假定连续键值和时使用此解决方案 您需要确保每一行都有均等的机会被选中。