为什么人们说在使用随机数生成器时存在模偏置?

我看到很多人问过这个问题，但从未见过一个真正具体的答案。所以我打算在这里发布一个，希望能帮助人们理解为什么在使用随机数生成器时，比如c++中的rand()，会有“模偏置”。

当前回答

RAND_MAX值为3(实际上它应该比这个值高得多，但偏差仍然存在)，从这些计算中可以看出存在偏差:

1% 2 = 1 2% 2 = 0 3% 2 = 1 Random_between(1,3) % 2 =更可能是1

在本例中，当您想要0到1之间的随机数时，不应该使用% 2。你可以通过% 3得到一个0到2之间的随机数，因为在这种情况下:RAND_MAX是3的倍数。

另一种方法

有更简单的方法，但要加上其他答案，这是我的解，得到一个0到n - 1之间的随机数，所以有n种不同的可能性，没有偏差。

编码可能性数量所需的比特数(不是字节数)就是您需要的随机数据的比特数从随机位编码数字如果这个数字是>= n，重新启动(不取模)。

真正随机的数据是不容易获得的，所以为什么使用比需要更多的比特。

下面是Smalltalk中的一个示例，使用伪随机数生成器的位缓存。我不是安全专家，所以请自担风险。

next: n

    | bitSize r from to |
    n < 0 ifTrue: [^0 - (self next: 0 - n)].
    n = 0 ifTrue: [^nil].
    n = 1 ifTrue: [^0].
    cache isNil ifTrue: [cache := OrderedCollection new].
    cache size < (self randmax highBit) ifTrue: [
        Security.DSSRandom default next asByteArray do: [ :byte |
            (1 to: 8) do: [ :i |    cache add: (byte bitAt: i)]
        ]
    ].
    r := 0.
    bitSize := n highBit.
    to := cache size.
    from := to - bitSize + 1.
    (from to: to) do: [ :i |
        r := r bitAt: i - from + 1 put: (cache at: i)
    ].
    cache removeFrom: from to: to.
    r >= n ifTrue: [^self next: n].
    ^r

2016-08-11 10:06:46

其他回答

我刚刚为冯·诺依曼无偏抛硬币法写了一段代码，理论上应该可以消除随机数生成过程中的任何偏差。更多信息请访问(http://en.wikipedia.org/wiki/Fair_coin)

int unbiased_random_bit() {    
    int x1, x2, prev;
    prev = 2;
    x1 = rand() % 2;
    x2 = rand() % 2;

    for (;; x1 = rand() % 2, x2 = rand() % 2)
    {
        if (x1 ^ x2)      // 01 -> 1, or 10 -> 0.
        {
            return x2;        
        }
        else if (x1 & x2)
        {
            if (!prev)    // 0011
                return 1;
            else
                prev = 1; // 1111 -> continue, bias unresolved
        }
        else
        {
            if (prev == 1)// 1100
                return 0;
            else          // 0000 -> continue, bias unresolved
                prev = 0;
        }
    }
}

2014-04-09 17:31:18

RAND_MAX值为3(实际上它应该比这个值高得多，但偏差仍然存在)，从这些计算中可以看出存在偏差:

1% 2 = 1 2% 2 = 0 3% 2 = 1 Random_between(1,3) % 2 =更可能是1

在本例中，当您想要0到1之间的随机数时，不应该使用% 2。你可以通过% 3得到一个0到2之间的随机数，因为在这种情况下:RAND_MAX是3的倍数。

另一种方法

有更简单的方法，但要加上其他答案，这是我的解，得到一个0到n - 1之间的随机数，所以有n种不同的可能性，没有偏差。

编码可能性数量所需的比特数(不是字节数)就是您需要的随机数据的比特数从随机位编码数字如果这个数字是>= n，重新启动(不取模)。

真正随机的数据是不容易获得的，所以为什么使用比需要更多的比特。

下面是Smalltalk中的一个示例，使用伪随机数生成器的位缓存。我不是安全专家，所以请自担风险。

next: n

    | bitSize r from to |
    n < 0 ifTrue: [^0 - (self next: 0 - n)].
    n = 0 ifTrue: [^nil].
    n = 1 ifTrue: [^0].
    cache isNil ifTrue: [cache := OrderedCollection new].
    cache size < (self randmax highBit) ifTrue: [
        Security.DSSRandom default next asByteArray do: [ :byte |
            (1 to: 8) do: [ :i |    cache add: (byte bitAt: i)]
        ]
    ].
    r := 0.
    bitSize := n highBit.
    to := cache size.
    from := to - bitSize + 1.
    (from to: to) do: [ :i |
        r := r bitAt: i - from + 1 put: (cache at: i)
    ].
    cache removeFrom: from to: to.
    r >= n ifTrue: [^self next: n].
    ^r

2016-08-11 10:06:46

因此rand()是一个伪随机数生成器，它在0和RAND_MAX之间选择一个自然数，RAND_MAX是cstdlib中定义的一个常量(有关rand()的一般概述，请参阅本文)。

现在如果你想生成一个0到2之间的随机数怎么办?为了便于解释，假设RAND_MAX为10，我决定通过调用rand()%3生成一个0到2之间的随机数。然而，rand()%3不会以相同的概率产生0和2之间的数字!

当rand()返回0、3、6或9时，rand()%3 == 0。因此，P(0) = 4/11

当rand()返回1,4,7或10时，rand()%3 == 1。因此，P(1) = 4/11

当rand()返回2,5或8时，rand()%3 == 2。因此，P(2) = 3/11

这不会以相等的概率生成0和2之间的数字。当然，对于较小的范围，这可能不是最大的问题，但对于较大的范围，这可能会扭曲分布，偏向较小的数字。

那么rand()%n何时以相等的概率返回从0到n-1的数字范围呢?当RAND_MAX%n == n - 1。在这种情况下，加上我们之前的假设rand()确实以相同的概率返回了一个介于0和RAND_MAX之间的数字，n的模类也将是均匀分布的。

那么我们如何解决这个问题呢?一种粗略的方法是不断生成随机数，直到你得到一个在你想要的范围内的数字:

int x; 
do {
    x = rand();
} while (x >= n);

但是对于n的值很低，这是低效的，因为你只有n/RAND_MAX的机会得到一个在你的范围内的值，所以你平均需要对rand()执行RAND_MAX/n次调用。

一个更有效的公式方法是取一个长度可被n整除的大范围，如RAND_MAX - RAND_MAX % n，不断生成随机数，直到你得到一个位于该范围内的随机数，然后取模量:

int x;

do {
    x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));

x %= n;

对于较小的n值，很少需要多次调用rand()。

引用作品及进一步阅读:

CPlusPlus参考永远Confuzzled

2012-06-11 17:44:04

不断随机选取是去除偏差的好方法。

更新

如果我们在能被n整除的范围内搜索x，我们可以让代码更快。

// Assumptions
// rand() in [0, RAND_MAX]
// n in (0, RAND_MAX]

int x; 

// Keep searching for an x in a range divisible by n 
do {
    x = rand();
} while (x >= RAND_MAX - (RAND_MAX % n)) 

x %= n;

上面的循环应该非常快，平均1次迭代。

2012-06-12 00:10:56

模约化是一种常见的方法，可以使随机整数生成器避免永远运行的最坏情况。

When the range of possible integers is unknown, however, there is no way in general to "fix" this worst case of running forever without introducing bias. It's not just modulo reduction (rand() % n, discussed in the accepted answer) that will introduce bias this way, but also the "multiply-and-shift" reduction of Daniel Lemire, or if you stop rejecting an outcome after a set number of iterations. (To be clear, this doesn't mean there is no way to fix the bias issues present in pseudorandom generators. For example, even though modulo and other reductions are biased in general, they will have no issues with bias if the range of possible integers is a power of 2 and if the random generator produces unbiased random bits or blocks of them.)

这个答案的其余部分将显示随机生成器中运行时间和偏差之间的关系。从这里开始，我们将假设我们有一个“真正的”随机生成器，可以产生无偏和独立的随机比特

In 1976, D. E. Knuth and A. C. Yao showed that any algorithm that produces random integers with a given probability, using only random bits, can be represented as a binary tree, where random bits indicate which way to traverse the tree and each leaf (endpoint) corresponds to an outcome. In this case, we're dealing with algorithms that generate random integers in [0, n), where each integer is chosen with probability 1/n. The algorithm is unbiased if the same number of leaves appear in the tree for all outcomes. But if 1/n has a non-terminating binary expansion (which will be the case if n is not a power of 2), the algorithm will be unbiased only if—

二叉树具有“无限”深度，或者二叉树的末端包含“排斥”叶，

无论哪种情况，算法都不会在常数时间内运行在最坏的情况下会一直运行下去。(另一方面，当n是2的幂时，最优二叉树的深度是有限的，并且没有拒绝节点。)

The binary tree concept also shows that any way to "fix" this worst-case time complexity will lead to bias in general. (Again, this doesn't mean there is no way to fix the bias issues present in pseudorandom generators.) For instance, modulo reductions are equivalent to a binary tree in which rejection leaves are replaced with labeled outcomes — but since there are more possible outcomes than rejection leaves, only some of the outcomes can take the place of the rejection leaves, introducing bias. The same kind of binary tree — and the same kind of bias — results if you stop rejecting after a set number of iterations. (However, this bias may be negligible depending on the application. There are also security aspects to random integer generation, which are too complicated to discuss in this answer.)

为了说明这一点，下面的JavaScript代码实现了J. Lumbroso(2013)提出的名为Fast Dice Roller的随机整数算法。请注意，它包括一个拒绝事件和一个循环，这是在一般情况下使算法无偏倚所必需的。

function randomInt(minInclusive, maxExclusive) {
 var maxInclusive = (maxExclusive - minInclusive) - 1
 var x = 1
 var y = 0
 while(true) {
    x = x * 2
    var randomBit = (Math.random() < 0.5 ? 0 : 1)
    y = y * 2 + randomBit
    if(x > maxInclusive) {
      if (y <= maxInclusive) { return y + minInclusive }
      // Rejection
      x = x - maxInclusive - 1
      y = y - maxInclusive - 1
    }
 }
}

Note

*这个答案不会涉及到C中的rand()函数，因为它有很多问题。这里最严重的问题可能是，C标准没有明确地为rand()返回的数字指定特定的分布，甚至没有统一的分布。

2020-07-14 20:09:56

为什么人们说在使用随机数生成器时存在模偏置?

推荐文章

最新文章

标签