根据Java文档,String对象的哈希代码是这样计算的:

S [0]*31^(n-1) + S [1]*31^(n-2) +…+ s (n - 1) 使用int算术,其中s[i]是 字符串的第i个字符,n是的长度 字符串,^表示取幂。

为什么用31作为乘数?

我知道乘数应该是一个相对较大的质数。那么为什么不是29岁,37岁,甚至97岁呢?


当前回答

我不确定,但我猜他们测试了一些质数样本,发现31在一些可能的字符串样本中给出了最好的分布。

其他回答

在(大多数)老式处理器上,乘以31可能相对便宜。例如,在ARM上,它只有一条指令:

RSB       r1, r0, r0, ASL #5    ; r1 := - r0 + (r0<<5)

大多数其他处理器都需要单独的移位和减法指令。然而,如果你的乘数很慢,这仍然是一种胜利。现代处理器往往具有快速乘法器,所以只要32在正确的一边,就没有太大区别。

这不是一个很好的哈希算法,但它已经足够好了,比1.0代码更好(比1.0规范好得多!)。

通过相乘,位向左移位。这使用了更多哈希码的可用空间,减少了冲突。

通过不使用2的幂,低阶,最右边的位也被填充,与进入散列的下一段数据混合。

表达式n * 31等价于(n << 5) - n。

根据Joshua Bloch的《Effective Java》(这本书再怎么推荐都不为过,多亏了stackoverflow上不断的提及,我才买了这本书):

选择值31是因为它是一个奇质数。如果它是偶数并且乘法溢出,信息就会丢失,因为乘2相当于移位。使用质数的优势不太明显,但它是传统的。31的一个很好的属性是乘法可以被移位和减法代替,以获得更好的性能:31 * i == (i << 5) - i。现代虚拟机自动进行这种优化。

(摘自第3章第9项:重写equals时总是重写hashcode,第48页)

在JDK-4045622中,Joshua Bloch描述了为什么选择特定的(新)String.hashCode()实现的原因

The table below summarizes the performance of the various hash functions described above, for three data sets: 1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars). 2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/ and /usr/openwin/bin/* (66,304 strings, avg length 21 characters). 3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters). The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element). Webster's Code Strings URLs --------- ------------ ---- Current Java Fn. 1.2509 1.2738 13.2560 P(37) [Java] 1.2508 1.2481 1.2454 P(65599) [Aho et al] 1.2490 1.2510 1.2450 P(31) [K+R] 1.2500 1.2488 1.2425 P(33) [Torek] 1.2500 1.2500 1.2453 Vo's Fn 1.2487 1.2471 1.2462 WAIS Fn 1.2497 1.2519 1.2452 Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864 Weinberger's Fn(24) 1.3222 1.2791 1.9732 Weinberger's Fn(28) 1.2530 1.2506 1.2439 Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function. I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous. Josh

我不确定,但我猜他们测试了一些质数样本,发现31在一些可能的字符串样本中给出了最好的分布。