


文件大小为4×109×4 bytes = 16gb。




假设我们讨论的是32位整数,有232 = 4*109个不同的整数。

情况1:我们有1gb = 1 * 109 * 8位= 80亿位内存。




int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);

情形二:10mb内存= 10 * 106 * 8bits = 8000万bits

Solution: For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. Build the counter of each bucket through the first pass through the file. Scan the buckets, find the first one who has less than 65536 hit. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file Scan the buckets built in step3, find the first bucket which doesnt have a hit. The code is very similar to above one.

结论: 我们通过增加文件传递来减少内存。



2128*1018 + 1(即(28)16*1018 + 1)——这难道不是今天的普遍答案吗?这表示一个不能保存在16eb文件中的数字,这是当前任何文件系统中的最大文件大小。





The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).

内存消耗:几十个字节,复杂度:O(n),开销:neclectable,因为大部分时间将花费在不可避免的硬盘访问上,而不是比较int类型。 当我们不假设静态分布时,实际最坏的情况是每个整数都出现最大值。曾经,因为那时只有 1 - 4000000000/2³²≈6% 所有的整数都不会出现在文件中。因此,您需要更多的猜测,但这仍然不会消耗大量的内存。



If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.

2128*1018 + 1(即(28)16*1018 + 1)——这难道不是今天的普遍答案吗?这表示一个不能保存在16eb文件中的数字,这是当前任何文件系统中的最大文件大小。



使用BitSet。40亿个整数(假设最多2^32个整数)以每字节8个的速度打包到BitSet中,大约是2^32 / 2^3 = 2^29 = 0.5 Gb。



看看BitSet API,它似乎只支持0..MAX_INT,所以你可能需要2个bitset -一个用于+ ve数字,一个用于- ve数字-但内存需求不会改变。