为什么哈希函数应该使用质数模?

很久以前，我花1.25美元在便宜货桌上买了一本数据结构的书。在这篇文章中，哈希函数的解释说，由于“数学的本质”，它最终应该被一个质数mod。

你对一本1.25美元的书有什么期待?

不管怎么说，我花了很多年思考数学的本质，但还是没弄明白。

当有质数个桶时，数字的分布真的更均匀吗?

或者这是一个老程序员的故事，每个人都接受，因为其他人都接受?

当前回答

我想为Steve Jessop的回答补充一些东西(我不能评论，因为我没有足够的声誉)。但我找到了一些有用的材料。他的回答很有帮助，但他犯了一个错误:桶的大小不应该是2的幂。我引用Thomas Cormen, Charles Leisersen等人写的《算法导论》263页

When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.

希望能有所帮助。

2015-12-03 17:43:02

其他回答

我想说，这个链接的第一个答案是我找到的关于这个问题的最清晰的答案。

考虑键K ={0,1，…，100}和一个哈希表，其中桶数为m = 12。因为3是12的因数，所以是3倍数的键将被散列到是3倍数的存储桶中:

键{0,12、24、36…}将被散列到bucket 0。键{3,15日,27日,39岁,…}将被散列到桶3。键{42 6日,18日,30日,…}将被散列到桶6。键{9日,21日,33岁,45岁,…}将被散列到桶9。

如果K是均匀分布的(即K中的每个键出现的可能性都是相等的)，那么m的选择就不是那么关键了。但是，如果K不是均匀分布的呢?想象最有可能出现的键是3的倍数。在这种情况下，所有不是3倍数的桶都很可能是空的(这在哈希表性能方面非常糟糕)。

这种情况比看起来更常见。例如，想象一下，您正在根据对象在内存中的存储位置来跟踪它们。如果您的计算机的字大小是4个字节，那么您将哈希键是4的倍数。不用说，选择m是4的倍数将是一个糟糕的选择:你将有3m/4个桶完全空了，所有的键都在剩下的m/4个桶中碰撞。

一般来说:

K中每一个与桶数m有公因数的键都将被哈希为这个因数的倍数。

因此，为了尽量减少碰撞，减少m和k的元素之间的公因数的数量是很重要的，这是如何实现的呢?通过选择m是一个因数很少的数，一个质数。

来自马里奥的回答。

2020-06-22 07:42:10

Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P. Say, you use such hash-function for strings of length <= N, and you have a collision. That means that 2 different polynomials produce the same value modulo P. The difference of those polynomials is again a polynomial of the same degree N (or less). It has no more than N roots (this is here the nature of math shows itself, since this claim is only true for a polynomial over a field => prime number). So if N is much less than P, you are likely not to have a collision. After that, experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10, and is small enough to use for calculations.

2013-11-26 01:04:11

博士tl;

Index [hash(input)%2]将导致所有可能哈希值的一半和一段值发生冲突。Index [hash(input)%prime]导致所有可能哈希值中的<2的碰撞。将除数固定为表的大小还可以确保数字不能大于表。

2012-11-06 01:31:06

只是把从答案中得到的一些想法写下来。

Hashing uses modulus so any value can fit into a given range We want to randomize collisions Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6 12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9 This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread

2021-12-29 07:56:25

插入/从哈希表中检索时要做的第一件事是计算给定键的hashCode，然后通过执行hashCode % table_length将hashCode修剪为哈希表的大小来找到正确的bucket。这里有两个“陈述”，你很可能在某处读到过

如果对table_length使用2的幂，那么查找(hashCode(key) % 2^n)就像查找(hashCode(key) & (2^n -1))一样简单快捷。但是如果你为一个给定的键计算hashCode的函数不是很好，你肯定会在几个散列桶中聚集许多键。但是，如果table_length使用质数，即使使用稍微愚蠢的hashCode函数，计算出来的hashCode也可以映射到不同的散列桶中。

这就是证明。

如果假设你的hashCode函数的结果是以下hashCode {x, 2x, 3x, 4x, 5x, 6x…}，那么所有这些都将聚集在m个桶中，其中m = table_length/GreatestCommonFactor(table_length, x)。(验证/推导这个很简单)。现在可以执行以下操作之一来避免集群

确保你不会生成太多的hashCode，这些hashCode是另一个hashCode的倍数，比如{x, 2x, 3x, 4x, 5x, 6x…}。但如果你的hashTable应该有数百万个条目，这可能有点困难。或者通过使GreatestCommonFactor(table_length, x)等于1使m等于table_length，即使table_length与x为coprime。如果x可以是任何数字，则确保table_length是质数。

来自- http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html

2009-09-23 06:58:18

为什么哈希函数应该使用质数模?

推荐文章

最新文章

标签