很久以前,我花1.25美元在便宜货桌上买了一本数据结构的书。在这篇文章中,哈希函数的解释说,由于“数学的本质”,它最终应该被一个质数mod。

你对一本1.25美元的书有什么期待?

不管怎么说,我花了很多年思考数学的本质,但还是没弄明白。

当有质数个桶时,数字的分布真的更均匀吗?

或者这是一个老程序员的故事,每个人都接受,因为其他人都接受?


当前回答

只是把从答案中得到的一些想法写下来。

Hashing uses modulus so any value can fit into a given range We want to randomize collisions Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6 12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9 This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread

其他回答

我想说,这个链接的第一个答案是我找到的关于这个问题的最清晰的答案。

考虑键K ={0,1,…,100}和一个哈希表,其中桶数为m = 12。因为3是12的因数,所以是3倍数的键将被散列到是3倍数的存储桶中:

键{0,12、24、36…}将被散列到bucket 0。 键{3,15日,27日,39岁,…}将被散列到桶3。 键{42 6日,18日,30日,…}将被散列到桶6。 键{9日,21日,33岁,45岁,…}将被散列到桶9。

如果K是均匀分布的(即K中的每个键出现的可能性都是相等的),那么m的选择就不是那么关键了。但是,如果K不是均匀分布的呢?想象最有可能出现的键是3的倍数。在这种情况下,所有不是3倍数的桶都很可能是空的(这在哈希表性能方面非常糟糕)。

这种情况比看起来更常见。例如,想象一下,您正在根据对象在内存中的存储位置来跟踪它们。如果您的计算机的字大小是4个字节,那么您将哈希键是4的倍数。不用说,选择m是4的倍数将是一个糟糕的选择:你将有3m/4个桶完全空了,所有的键都在剩下的m/4个桶中碰撞。

一般来说:

K中每一个与桶数m有公因数的键都将被哈希为这个因数的倍数。

因此,为了尽量减少碰撞,减少m和k的元素之间的公因数的数量是很重要的,这是如何实现的呢?通过选择m是一个因数很少的数,一个质数。

来自马里奥的回答。

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

解释得很清楚,还有图片。

编辑:作为一个总结,使用质数是因为当数值乘以所选质数并将它们全部相加时,获得唯一值的可能性最大。例如,给定一个字符串,将每个字母的值与质数相乘,然后将它们全部相加,就会得到它的哈希值。

一个更好的问题是,为什么是数字31?

对于一个哈希函数来说,重要的不仅仅是尽量减少冲突,而且是不可能在改变几个字节的同时保持相同的哈希。

假设你有一个方程: (x + y*z) % key = x且0<x<key且0<z<key。 如果key是一个质数n*y=key对于n中的每一个n为真,对于其他所有数为假。

一个key不是主要示例的例子: X =1, z=2, key=8 因为key/z=4仍然是一个自然数,4成为我们方程的一个解,在这种情况下(n/2)*y = key对于n中的每一个n都成立。这个方程的解的数量实际上翻了一番,因为8不是质数。

如果我们的攻击者已经知道8是方程的可能解,他可以将文件从产生8改为产生4,并且仍然得到相同的哈希值。

Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P. Say, you use such hash-function for strings of length <= N, and you have a collision. That means that 2 different polynomials produce the same value modulo P. The difference of those polynomials is again a polynomial of the same degree N (or less). It has no more than N roots (this is here the nature of math shows itself, since this claim is only true for a polynomial over a field => prime number). So if N is much less than P, you are likely not to have a collision. After that, experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10, and is small enough to use for calculations.

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions. Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used. However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about http://www.azillionmonkeys.com/qed/hash.html

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/