我的面试问题是这样的:

给定一个包含40亿个整数的输入文件,提供一种算法来生成一个文件中不包含的整数。假设您有1gb内存。如果你只有10mb的内存,你会怎么做。

我的分析:

文件大小为4×109×4 bytes = 16gb。

我们可以进行外部排序,从而知道整数的范围。

我的问题是,在已排序的大整数集中检测缺失整数的最佳方法是什么?

我的理解(看完所有答案后):

假设我们讨论的是32位整数,有232 = 4*109个不同的整数。

情况1:我们有1gb = 1 * 109 * 8位= 80亿位内存。

解决方案:

如果我们用一位表示一个不同的整数,这就足够了。我们不需要排序。

实现:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

情形二:10mb内存= 10 * 106 * 8bits = 8000万bits

Solution: For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. Build the counter of each bucket through the first pass through the file. Scan the buckets, find the first one who has less than 65536 hit. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file Scan the buckets built in step3, find the first bucket which doesnt have a hit. The code is very similar to above one.

结论: 我们通过增加文件传递来减少内存。


为那些迟到的人澄清一下:这个问题,正如所问的,并不是说有一个整数没有包含在文件中——至少大多数人不是这么理解的。不过,评论线程中的许多评论都是关于任务的变化。不幸的是,把它引入评论线程的评论后来被它的作者删除了,所以现在看起来它的孤儿回复只是误解了一切。这很让人困惑,抱歉。


当前回答

If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.

In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.

但我们假设不是有40亿个数字;你有232 - 1这样的数字和不到10mb的RAM;所以任何小范围的整数都只有很小的可能性不包含这个数字。

如果保证列表中的每个int都是唯一的,那么可以将这些数字相加,并减去一个#,再减去完整的和(½)(232)(232 - 1)= 9223372034707292160,以找到缺少的int。但是,如果出现了两次int,则此方法将失败。

However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.

这种方法效率很低。我们在每一步中只使用两个整数(或者大约8字节的RAM和一个4字节(32位)整数)。更好的方法是将其划分为sqrt(232) = 216 = 65536个箱子,每个箱子中有65536个数字。每个bin需要4个字节来存储它的计数,因此需要218字节= 256 kB。因此,bin 0为(0 ~ 65535=216-1),bin 1为(216=65536 ~ 2*216-1=131071),bin 2为(2*216=131072 ~ 3*216-1=196607)。在python中,你会有这样的代码:

import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
    nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
    if bin_count < 65536:
        break # we have found an incomplete bin with missing ints (bin_num)

通读~ 40亿整数列表;然后计算216个容器中每个容器中有多少int,并找到一个不包含65536个数字的incomplete_bin。然后你再读一遍40亿的整数列表;但这次只注意整数在这个范围内;当你找到他们的时候,你会有点抓狂。

del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
    if N // 65536 == bin_num:
        my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
    if not bit:
        print bin_num*65536 + i
        break

其他回答

为了完整起见,这里有另一个非常简单的解决方案,它很可能需要很长时间才能运行,但只使用很少的内存。

设所有可能的整数为从int_min到int_max的范围,和 bool isNotInFile(integer)一个函数,如果文件不包含某个整数,则返回true,否则返回false(通过将该特定整数与文件中的每个整数进行比较)

for (integer i = int_min; i <= int_max; ++i)
{
    if (isNotInFile(i)) {
        return i;
    }
}

正如Ryan所说,基本上,对文件进行排序,然后遍历整数,当一个值被跳过时,你就有了:)

EDIT at downvotes: OP提到文件可以排序,所以这是一个有效的方法。

出于某种原因,当我读到这个问题时,我想到了对角化。假设是任意大的整数。

Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.

给定一个包含40亿个整数的输入文件,提供一个算法 生成文件中不包含的整数。假设你 有1gib的内存。接着问如果只有你会怎么做 10内存MiB。 文件大小为4 * 109 * 4字节= 16gib

如果是32位无符号整数

0 <= Number < 2^32
0 <= Number < 4,294,967,296

我建议的解决方案是:c++不进行错误检查

#include <vector>
#include <fstream>
#include <iostream>
using namespace std;

int main ()
{
    const long SIZE = 1L << 32;

    std::vector<bool> checker(SIZE, false);

    std::ifstream infile("file.txt");  // TODO: error checking

    unsigned int num = 0;

    while (infile >> num)
    {
        checker[num] = true ;
    }

    infile.close();

    // print missing numbers

    for (long i = 0; i < SIZE; i++)
    {
        if (!checker[i])
            cout << i << endl ;
    }

    return 0;
}

复杂性

Space ~ 232 bits = 229 Bytes = 219 KB = 29 MB = 1/2 GB 时间~单次通过 完整性~是

根据原题中目前的措辞,最简单的解决方法是:

找到文件中的最大值,然后加上1。