为什么Java的hashCode()在字符串使用31作为乘数?

根据Java文档，String对象的哈希代码是这样计算的:

S [0]*31^(n-1) + S [1]*31^(n-2) +…+ s (n - 1) 使用int算术，其中s[i]是字符串的第i个字符，n是的长度字符串，^表示取幂。

为什么用31作为乘数?

我知道乘数应该是一个相对较大的质数。那么为什么不是29岁，37岁，甚至97岁呢?

当前回答

在JDK-4045622中，Joshua Bloch描述了为什么选择特定的(新)String.hashCode()实现的原因

The table below summarizes the performance of the various hash functions described above, for three data sets: 1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars). 2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/ and /usr/openwin/bin/* (66,304 strings, avg length 21 characters). 3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters). The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element). Webster's Code Strings URLs --------- ------------ ---- Current Java Fn. 1.2509 1.2738 13.2560 P(37) [Java] 1.2508 1.2481 1.2454 P(65599) [Aho et al] 1.2490 1.2510 1.2450 P(31) [K+R] 1.2500 1.2488 1.2425 P(33) [Torek] 1.2500 1.2500 1.2453 Vo's Fn 1.2487 1.2471 1.2462 WAIS Fn 1.2497 1.2519 1.2452 Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864 Weinberger's Fn(24) 1.3222 1.2791 1.9732 Weinberger's Fn(28) 1.2530 1.2506 1.2439 Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function. I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous. Josh

2017-06-12 21:17:03

其他回答

You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:

剩下的第四，我可能会选择P(31)，因为它是RISC上最便宜的计算方法机器(因为31是2的2次幂之差)P (33) 同样，计算起来很便宜，但它的性能稍微差一些，并且 33是合数，这让我有点紧张。

因此，推理并不像这里的许多答案所暗示的那样理性。但我们都善于在做出直觉决定后提出理性的理由(甚至布洛赫也可能倾向于这样)。

2016-02-10 00:46:26

尼尔·科菲解释了为什么在熨平偏差下使用31。

基本上，使用31可以为哈希函数提供更均匀的集位概率分布。

2011-12-07 15:27:18

我不确定，但我猜他们测试了一些质数样本，发现31在一些可能的字符串样本中给出了最好的分布。

2008-11-18 16:58:03

Java字符串hashCode()和31

这是因为31有一个很好的属性——它的乘法运算可以被逐位移位取代，这比标准乘法运算快得多: