为什么哈希函数应该使用质数模?

很久以前，我花1.25美元在便宜货桌上买了一本数据结构的书。在这篇文章中，哈希函数的解释说，由于“数学的本质”，它最终应该被一个质数mod。

你对一本1.25美元的书有什么期待?

不管怎么说，我花了很多年思考数学的本质，但还是没弄明白。

当有质数个桶时，数字的分布真的更均匀吗?

或者这是一个老程序员的故事，每个人都接受，因为其他人都接受?

当前回答

我想说，这个链接的第一个答案是我找到的关于这个问题的最清晰的答案。

考虑键K ={0,1，…，100}和一个哈希表，其中桶数为m = 12。因为3是12的因数，所以是3倍数的键将被散列到是3倍数的存储桶中:

键{0,12、24、36…}将被散列到bucket 0。键{3,15日,27日,39岁,…}将被散列到桶3。键{42 6日,18日,30日,…}将被散列到桶6。键{9日,21日,33岁,45岁,…}将被散列到桶9。

如果K是均匀分布的(即K中的每个键出现的可能性都是相等的)，那么m的选择就不是那么关键了。但是，如果K不是均匀分布的呢?想象最有可能出现的键是3的倍数。在这种情况下，所有不是3倍数的桶都很可能是空的(这在哈希表性能方面非常糟糕)。

这种情况比看起来更常见。例如，想象一下，您正在根据对象在内存中的存储位置来跟踪它们。如果您的计算机的字大小是4个字节，那么您将哈希键是4的倍数。不用说，选择m是4的倍数将是一个糟糕的选择:你将有3m/4个桶完全空了，所有的键都在剩下的m/4个桶中碰撞。

一般来说:

K中每一个与桶数m有公因数的键都将被哈希为这个因数的倍数。

因此，为了尽量减少碰撞，减少m和k的元素之间的公因数的数量是很重要的，这是如何实现的呢?通过选择m是一个因数很少的数，一个质数。

来自马里奥的回答。

2020-06-22 07:42:10

其他回答

博士tl;

Index [hash(input)%2]将导致所有可能哈希值的一半和一段值发生冲突。Index [hash(input)%prime]导致所有可能哈希值中的<2的碰撞。将除数固定为表的大小还可以确保数字不能大于表。

2012-11-06 01:31:06

通常，一个简单的哈希函数的工作原理是，取输入的“组成部分”(在字符串的情况下是字符)，将它们乘以某个常数的幂，然后以某种整数类型将它们相加。例如，一个字符串的典型哈希值(虽然不是特别好)可能是:

(first char) + k * (second char) + k^2 * (third char) + ...

然后，如果输入了一堆具有相同首字符的字符串，那么结果将都是相同的k模，至少在整数类型溢出之前是这样。

[举个例子，Java的字符串hashCode与此惊人地相似——它将字符的顺序颠倒，k=31。所以你会得到以31为模的惊人的关系在以相同方式结束的字符串之间，以及以2^32为模的惊人的关系在除了接近结尾的字符串之间都是相同的。这并没有严重扰乱哈希表行为。]

哈希表的工作原理是将哈希的模数除以桶的数量。

在哈希表中，不为可能的情况产生冲突是很重要的，因为冲突会降低哈希表的效率。

现在，假设有人将一大堆值放入一个哈希表中，这些值在项目之间有某种关系，比如所有的第一个字符都相同。我想说，这是一种相当可预测的使用模式，所以我们不希望它产生太多冲突。

It turns out that "because of the nature of maths", if the constant used in the hash, and the number of buckets, are coprime, then collisions are minimised in some common cases. If they are not coprime, then there are some fairly simple relationships between inputs for which collisions are not minimised. All the hashes come out equal modulo the common factor, which means they'll all fall into the 1/n th of the buckets which have that value modulo the common factor. You get n times as many collisions, where n is the common factor. Since n is at least 2, I'd say it's unacceptable for a fairly simple use case to generate at least twice as many collisions as normal. If some user is going to break our distribution into buckets, we want it to be a freak accident, not some simple predictable usage.

现在，哈希表实现显然无法控制放入其中的项。他们不能阻止他们之间的联系。所以要做的就是确保常量和桶数都是互质。这样你就不需要单独依靠“最后一个”分量来确定桶的模数相对于某个小的公共因子。据我所知，它们不一定是质数，只要是质素就可以了。

But if the hash function and the hashtable are written independently, then the hashtable doesn't know how the hash function works. It might be using a constant with small factors. If you're lucky it might work completely differently and be nonlinear. If the hash is good enough, then any bucket count is just fine. But a paranoid hashtable can't assume a good hash function, so should use a prime number of buckets. Similarly a paranoid hash function should use a largeish prime constant, to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant.

在实践中，我认为使用2的幂作为桶的数量是相当正常的。这很方便，并且省去了四处搜索或预先选择正确大小的质数的麻烦。所以你依赖于哈希函数而不是使用偶数乘数，这通常是一个安全的假设。但是，基于上面的哈希函数，您仍然会偶尔遇到糟糕的哈希行为，而素数桶计数可能会有进一步的帮助。

就我所知，提出“所有东西都必须是质数”的原则是在哈希表上进行良好分布的充分条件，而不是必要条件。它允许每个人进行互操作，而不需要假设其他人遵循相同的规则。

[Edit: there's another, more specialized reason to use a prime number of buckets, which is if you handle collisions with linear probing. Then you calculate a stride from the hashcode, and if that stride comes out to be a factor of the bucket count then you can only do (bucket_count / stride) probes before you're back where you started. The case you most want to avoid is stride = 0, of course, which must be special-cased, but to avoid also special-casing bucket_count / stride equal to a small integer, you can just make the bucket_count prime and not care what the stride is provided it isn't 0.]

2009-07-18 10:43:06

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

解释得很清楚，还有图片。

编辑:作为一个总结，使用质数是因为当数值乘以所选质数并将它们全部相加时，获得唯一值的可能性最大。例如，给定一个字符串，将每个字母的值与质数相乘，然后将它们全部相加，就会得到它的哈希值。

一个更好的问题是，为什么是数字31?

2009-07-17 19:33:27

我想为Steve Jessop的回答补充一些东西(我不能评论，因为我没有足够的声誉)。但我找到了一些有用的材料。他的回答很有帮助，但他犯了一个错误:桶的大小不应该是2的幂。我引用Thomas Cormen, Charles Leisersen等人写的《算法导论》263页

When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.

希望能有所帮助。

2015-12-03 17:43:02

对于一个哈希函数来说，重要的不仅仅是尽量减少冲突，而且是不可能在改变几个字节的同时保持相同的哈希。

假设你有一个方程: (x + y*z) % key = x且0<x<key且0<z<key。如果key是一个质数n*y=key对于n中的每一个n为真，对于其他所有数为假。

一个key不是主要示例的例子: X =1, z=2, key=8 因为key/z=4仍然是一个自然数，4成为我们方程的一个解，在这种情况下(n/2)*y = key对于n中的每一个n都成立。这个方程的解的数量实际上翻了一番，因为8不是质数。

如果我们的攻击者已经知道8是方程的可能解，他可以将文件从产生8改为产生4，并且仍然得到相同的哈希值。

2009-07-18 14:01:27

为什么哈希函数应该使用质数模?

推荐文章

最新文章

标签