生成一个不在40亿个给定整数中的整数

我的面试问题是这样的:

给定一个包含40亿个整数的输入文件，提供一种算法来生成一个文件中不包含的整数。假设您有1gb内存。如果你只有10mb的内存，你会怎么做。

我的分析:

文件大小为4×109×4 bytes = 16gb。

我们可以进行外部排序，从而知道整数的范围。

我的问题是，在已排序的大整数集中检测缺失整数的最佳方法是什么?

我的理解(看完所有答案后):

假设我们讨论的是32位整数，有232 = 4*109个不同的整数。

情况1:我们有1gb = 1 * 109 * 8位= 80亿位内存。

解决方案:

如果我们用一位表示一个不同的整数，这就足够了。我们不需要排序。

实现:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

情形二:10mb内存= 10 * 106 * 8bits = 8000万bits

Solution: For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. Build the counter of each bucket through the first pass through the file. Scan the buckets, find the first one who has less than 65536 hit. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file Scan the buckets built in step3, find the first bucket which doesnt have a hit. The code is very similar to above one.

结论: 我们通过增加文件传递来减少内存。

为那些迟到的人澄清一下:这个问题，正如所问的，并不是说有一个整数没有包含在文件中——至少大多数人不是这么理解的。不过，评论线程中的许多评论都是关于任务的变化。不幸的是，把它引入评论线程的评论后来被它的作者删除了，所以现在看起来它的孤儿回复只是误解了一切。这很让人困惑，抱歉。

当前回答

关于这个问题的详细讨论已经在Jon Bentley的“第一栏”中讨论过。“编程珍珠”Addison-Wesley第3-10页

Bentley讨论了几种方法，包括外部排序，使用几个外部文件的归并排序等，但Bentley建议的最佳方法是使用位字段的单次传递算法，他幽默地称之为“神奇排序”:) 来看看这个问题，40亿个数字可以表示为:

4 billion bits = (4000000000 / 8) bytes = about 0.466 GB

实现bitset的代码很简单:(取自解决方案页面)

#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];

void set(int i) {        a[i>>SHIFT] |=  (1<<(i & MASK)); }
void clr(int i) {        a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int  test(int i){ return a[i>>SHIFT] &   (1<<(i & MASK)); }

Bentley的算法只对文件进行一次传递，在数组中设置适当的位，然后使用上面的测试宏检查这个数组以找到缺失的数字。

如果可用内存小于0.466 GB, Bentley建议使用k-pass算法，根据可用内存将输入划分为不同的范围。举一个非常简单的例子，如果只有1个字节(即处理8个数字的内存)可用，并且范围从0到31，我们将其分为0到7、8-15、16-22等范围，并在每次32/8 = 4次传递中处理这个范围。

HTH.

2011-08-23 04:20:53

其他回答

The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number. The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage. A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n). The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.

2011-08-22 21:35:45

假设“整数”表示32位:10mb的空间足以让您计算输入文件中有多少个数字，具有任何给定的16位前缀，对于所有可能的16位前缀，在一次通过输入文件。至少有一个桶被击中的次数少于216次。执行第二次传递，以查找该bucket中哪些可能的数字已经被使用。

如果它意味着超过32位，但仍然是有限的大小:执行上述操作，忽略所有恰巧落在(有符号或无符号;32位范围。

如果“integer”指的是数学整数:通读输入一次，记录你见过的最长数字中最大的数字长度。当你完成后，输出最大值加1是一个多一位的随机数。(文件中的一个数字可能是一个大于10mb的大字节，但如果输入是一个文件，那么您至少可以表示任何适合它的长度)。

2011-08-22 21:28:00

你不需要对它们排序，只需要重复划分它们的子集。

The first step is like the first pass of a quicksort. Pick one of the integers, x, and using it make a pass through the array to put all the values less than x to its left and values more than x to its right. Find which side of x has the greatest number of available slots (integers not in the list). This is easily computable by comparing the value of x with its position. Then repeat the partition on the sub-list on that side of x. Then repeat the partition on the sub-sub list with the greatest number of available integers, etc. Total number of compares to get down to an empty range should be about 4 billion, give or take.

2011-08-25 05:52:51

由于问题没有指定我们必须找到文件中不存在的最小数字，我们可以生成一个比输入文件本身更长的数字。：）

2011-08-23 11:07:04

如果没有大小限制，最快的方法是取文件的长度，并生成文件的长度+1个随机数字(或者只是“11111…”s).优点:您甚至不需要读取文件，并且可以将内存使用最小化到几乎为零。缺点:将打印数十亿个数字。

但是，如果唯一的因素是最小化内存使用，而其他因素都不重要，那么这将是最佳解决方案。它甚至可能让你获得“最严重滥用规则”奖。

2011-08-24 06:09:23

生成一个不在40亿个给定整数中的整数

推荐文章

最新文章

标签