我想从一个巨大的集合(1亿条记录)中获得一个随机记录。

最快最有效的方法是什么?

数据已经在那里,没有字段可以生成随机数并获得随机行。


当前回答

当我面对类似的解决方案时,我回溯并发现业务请求实际上是为了创建所呈现的库存的某种形式的轮换。在这种情况下,有更好的选择,它们有来自Solr这样的搜索引擎的答案,而不是MongoDB这样的数据存储。

In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).

如果用户的范围太大,可以将用户划分为行为组,按行为组而不是按用户进行索引。

如果产品范围足够小,您可以为每个用户创建一个索引。

我发现这种技术效率更高,但更重要的是在创建相关的、有价值的软件解决方案使用体验方面更有效。

其他回答

您还可以在执行查询后使用shuffle-array

Var shuffle = require('shuffle-array');

Accounts.find (qry函数(呃,results_array) { newIndexArr = shuffle (results_array);

为了获得确定数量的无重复的随机文档:

first get all ids get size of documents loop geting random index and skip duplicated number_of_docs=7 db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) { count=arr.length idsram=[] rans=[] while(number_of_docs!=0){ var R = Math.floor(Math.random() * count); if (rans.indexOf(R) > -1) { continue } else { ans.push(R) idsram.push(arr[R]._id) number_of_docs-- } } db.collection('preguntas').find({}).toArray(function(err1, doc1) { if (err1) { console.log(err1); return; } res.send(doc1) }); });

我对php的解决方案:

/**
 * Get random docs from Mongo
 * @param $collection
 * @param $where
 * @param $fields
 * @param $limit
 * @author happy-code
 * @url happy-code.com
 */
private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) {

    // Total docs
    $count = $collection->find($where, $fields)->count();

    if (!$limit) {
        // Get all docs
        $limit = $count;
    }

    $data = array();
    for( $i = 0; $i < $limit; $i++ ) {

        // Skip documents
        $skip = rand(0, ($count-1) );
        if ($skip !== 0) {
            $doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext();
        } else {
            $doc = $collection->find($where, $fields)->limit(1)->getNext();
        }

        if (is_array($doc)) {
            // Catch document
            $data[ $doc['_id']->{'$id'} ] = $doc;
            // Ignore current document when making the next iteration
            $where['_id']['$nin'][] = $doc['_id'];
        }

        // Every iteration catch document and decrease in the total number of document
        $count--;

    }

    return $data;
}

在Mongoose中最好的方法是使用$sample进行聚合调用。 然而,Mongoose并不会将Mongoose文档应用到Aggregation上——尤其是当populate()也被应用的时候。

从数据库中获取一个“精益”数组:

/*
Sample model should be init first
const Sample = mongoose …
*/

const samples = await Sample.aggregate([
  { $match: {} },
  { $sample: { size: 33 } },
]).exec();
console.log(samples); //a lean Array

获取mongoose文档数组:

const samples = (
  await Sample.aggregate([
    { $match: {} },
    { $sample: { size: 27 } },
    { $project: { _id: 1 } },
  ]).exec()
).map(v => v._id);

const mongooseSamples = await Sample.find({ _id: { $in: samples } });

console.log(mongooseSamples); //an Array of mongoose documents

下面是一种使用_id的默认ObjectId值和一些数学和逻辑的方法。

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

这是shell表示法的一般逻辑,很容易适应。

所以在点上:

查找集合中的最小和最大主键值 生成一个位于这些文档的时间戳之间的随机数。 将随机数与最小值相加,然后找到大于或等于该值的第一个文档。

这使用了从“十六进制”的时间戳值中“填充”来形成有效的ObjectId值,因为这就是我们正在寻找的。使用整数作为_id值本质上更简单,但在点中基本思想相同。