根据Java文档,String对象的哈希代码是这样计算的:
S [0]*31^(n-1) + S [1]*31^(n-2) +…+ s (n - 1) 使用int算术,其中s[i]是 字符串的第i个字符,n是的长度 字符串,^表示取幂。
为什么用31作为乘数?
我知道乘数应该是一个相对较大的质数。那么为什么不是29岁,37岁,甚至97岁呢?
根据Java文档,String对象的哈希代码是这样计算的:
S [0]*31^(n-1) + S [1]*31^(n-2) +…+ s (n - 1) 使用int算术,其中s[i]是 字符串的第i个字符,n是的长度 字符串,^表示取幂。
为什么用31作为乘数?
我知道乘数应该是一个相对较大的质数。那么为什么不是29岁,37岁,甚至97岁呢?
当前回答
You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:
剩下的 第四,我可能会选择P(31),因为它是RISC上最便宜的计算方法 机器(因为31是2的2次幂之差)P (33) 同样,计算起来很便宜,但它的性能稍微差一些,并且 33是合数,这让我有点紧张。
因此,推理并不像这里的许多答案所暗示的那样理性。但我们都善于在做出直觉决定后提出理性的理由(甚至布洛赫也可能倾向于这样)。
其他回答
对哈希函数的一个很大的期望是,它们的结果的均匀随机性在哈希(x) % N这样的操作中仍然存在,其中N是一个任意数字(在许多情况下,是2的幂),一个原因是这样的操作通常用于哈希表中确定槽。在计算散列时使用质数乘数会降低乘数和N个共享除数的概率,这将使操作的结果不那么均匀随机。
还有人指出了一个很好的性质,即乘31可以由一个乘法和一个减法来完成。我只是想指出这样的质数有一个数学术语:梅森质数
所有梅森质数都比2的幂小1所以我们可以写成
p = 2^n - 1
用x乘以p:
x * p = x * (2^n - 1) = x * 2^n - x = (x << n) - x
在许多机器上,移位(SAL/SHL)和减法(SUB)通常比乘法(MUL)快。参见Agner Fog的指令表
这就是为什么GCC似乎通过将梅森质数的乘法替换为移位和置换来优化梅森质数的乘法,见这里。
然而,在我看来,如此小的质数对于哈希函数来说是一个糟糕的选择。对于一个相对较好的哈希函数,你会期望在哈希的较高位具有随机性。然而,使用Java哈希函数,对于较短的字符串,在较高的位上几乎没有随机性(在较低的位上仍然具有高度可疑的随机性)。这使得构建高效哈希表变得更加困难。看看这个你用Java哈希函数做不到的小技巧。
一些回答提到,他们认为31适合一个字节是好的。这实际上是无用的,因为:
(1)我们执行移位而不是乘法,因此乘数的大小无关紧要。
(2)据我所知,没有特定的x86指令来将一个8字节的值与一个1字节的值相乘,所以即使是相乘,你也需要将“31”转换为8字节的值。看这里,你将整个64位寄存器相乘。
(127实际上是一个字节所能容纳的最大梅森素数。)
较小的值是否会增加中下位的随机性?也许吧,但这似乎也大大增加了可能的碰撞:)。
人们可以列出许多不同的问题,但它们通常可以归结为两个未能很好地实现的核心原则:混乱和扩散
但是速度快吗?可能吧,因为它没什么用。然而,如果性能真的是这里的重点,那么每个循环一个字符是相当低效的。对于更长的字符串,为什么不每次循环迭代4个字符(8字节),就像这样?好吧,这将很难与当前定义的哈希,你需要单独乘以每个字符(请告诉我如果有一点hack来解决这个问题:D)。
事实上,37就可以了!Z:= 37 * x可计算为y:= x + 8 * x;z:= x + 4 * y。这两个步骤都对应一个LEA x86指令,所以这是非常快的。
事实上,通过设置y:= x + 8 * x,可以以同样的速度完成与更大的素数73的乘法运算;Z:= x + 8 * y。
使用73或37(而不是31)可能会更好,因为它会导致更密集的代码:两条LEA指令只占用6个字节,而move+shift+subtract(31的乘法)占用7个字节。一个可能的警告是,这里使用的3参数LEA指令在英特尔的Sandy桥架构上变慢了,延迟增加了3个周期。
而且,73是谢尔顿·库珀最喜欢的数字。
在JDK-4045622中,Joshua Bloch描述了为什么选择特定的(新)String.hashCode()实现的原因
The table below summarizes the performance of the various hash functions described above, for three data sets: 1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars). 2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/ and /usr/openwin/bin/* (66,304 strings, avg length 21 characters). 3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters). The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element). Webster's Code Strings URLs --------- ------------ ---- Current Java Fn. 1.2509 1.2738 13.2560 P(37) [Java] 1.2508 1.2481 1.2454 P(65599) [Aho et al] 1.2490 1.2510 1.2450 P(31) [K+R] 1.2500 1.2488 1.2425 P(33) [Torek] 1.2500 1.2500 1.2453 Vo's Fn 1.2487 1.2471 1.2462 WAIS Fn 1.2497 1.2519 1.2452 Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864 Weinberger's Fn(24) 1.3222 1.2791 1.9732 Weinberger's Fn(28) 1.2530 1.2506 1.2439 Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function. I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous. Josh
尼尔·科菲解释了为什么在熨平偏差下使用31。
基本上,使用31可以为哈希函数提供更均匀的集位概率分布。
You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:
剩下的 第四,我可能会选择P(31),因为它是RISC上最便宜的计算方法 机器(因为31是2的2次幂之差)P (33) 同样,计算起来很便宜,但它的性能稍微差一些,并且 33是合数,这让我有点紧张。
因此,推理并不像这里的许多答案所暗示的那样理性。但我们都善于在做出直觉决定后提出理性的理由(甚至布洛赫也可能倾向于这样)。