哈希表是如何工作的?

基本思路

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

在高层次上，哈希表是一种存储东西的方式，有点像衣服的梳妆台。其基本思想如下:

你有一些可以存储物品的位置(抽屉)。你想出一些规则，告诉你每件物品属于哪个位置(抽屉)。当你需要找东西时，你就用这个规则来决定要找哪个抽屉。

这样的系统的优点是，假设您的规则不是太复杂，并且您有适当数量的抽屉，您可以通过查找正确的位置来快速找到您要找的东西。

当你把衣服放好时，你使用的“规则”可能是“袜子放在左边最上面的抽屉里，衬衫放在中间的大抽屉里，等等。”当你存储更抽象的数据时，我们使用一种叫做哈希函数的东西来为我们做这件事。

考虑哈希函数的一种合理方式是将其视为一个黑盒。你把数据放在一边，一个叫做哈希码的数字从另一边出来。从示意图上看，它是这样的:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+

All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

如何构建哈希函数呢?现在，让我们选择“正派的人不应该想太多”。数学家们已经想出了更好和更差的方法来设计哈希函数，但对于我们的目的，我们真的不需要太担心内部。把哈希函数看成是这样的函数就很好了

确定性的(相同的输入给出相同的输出)，但是看起来是随机的(很难预测一个哈希码给出另一个)。

Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

要在哈希表中查找内容，我们基本上使用相同的过程。我们首先计算要查找的项的哈希代码，它告诉我们要查找哪个桶(抽屉)。如果条目在表中，它就必须在那个bucket中。然后，我们只需查看桶中的所有项，看看我们的项是否在其中。

What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

细节

解释“哈希表”是如何工作的有点棘手，因为哈希表有很多种。下一节将讨论所有哈希表通用的一些通用实现细节，以及不同风格的哈希表如何工作的一些细节。

A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

(这里有时会使用位运算符。如果你的表的大小是2的幂，比如说2k，那么计算哈希码的位与，然后数字2k - 1相当于计算一个模数，而且它明显更快。)

下一个问题是如何选择正确的桶数。如果您选择太多的bucket，那么大多数bucket将是空的或只有很少的元素(对速度有好处-每个bucket只需要检查一些项)，但是您将使用大量的空间来简单地存储bucket(不是很好，尽管也许您可以负担得起)。反之亦然——如果存储桶太少，那么每个存储桶平均会有更多的元素，这会使查找时间变长，但会减少内存使用量。

A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

哈希策略

到目前为止，我们一直在讨论链式哈希，这是构建哈希表的许多不同策略之一。提醒一下，链式哈希有点像一个服装梳妆台——每个桶(抽屉)可以容纳多个项目，当你进行查找时，你会检查所有这些项目。

然而，这并不是构建哈希表的唯一方法。还有另一类哈希表使用一种叫做开放寻址的策略。开放寻址的基本思想是存储一个槽数组，其中每个槽可以是空的，也可以只保存一项。

In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

对于布谷鸟哈希有许多改进的可能，例如使用多个表，让每个槽容纳多个项目，以及制作一个“隐藏”来保存其他地方无法容纳的项目，这是一个活跃的研究领域!

Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

进一步的阅读

有关哈希表基础知识的更多信息，请查看关于链式哈希的讲座幻灯片以及关于线性探测和罗宾汉哈希的后续幻灯片。你可以在这里学到更多关于布谷鸟哈希的知识，以及哈希函数的理论性质。

2020-09-11 20:48:06

到目前为止，所有的答案都很好，并且从不同的方面了解了哈希表的工作方式。这里有一个简单的例子，可能会有帮助。假设我们想要存储一些带有小写字母字符串的项作为键。

正如simon所解释的，哈希函数用于从大空间映射到小空间。对于我们的例子，一个简单的哈希函数实现可以取字符串的第一个字母，并将其映射为一个整数，因此“短吻鳄”的哈希代码为0，“蜜蜂”的哈希代码为1，“斑马”的哈希代码为25，等等。

接下来，我们有一个包含26个存储桶的数组(在Java中可以是数组列表)，我们将项放入与键的哈希码匹配的存储桶中。如果我们有不止一个元素键以相同字母开头，它们就会有相同的哈希码，所以它们都会进入存储桶中寻找那个哈希码所以必须在存储桶中进行线性搜索才能找到一个特定的元素。

在我们的例子中，如果我们只有几十个项目，键横跨字母表，它会工作得很好。然而，如果我们有一百万个条目，或者所有的键都以'a'或'b'开头，那么我们的哈希表就不是理想的。为了获得更好的性能，我们需要一个不同的哈希函数和/或更多的桶。

2009-04-08 16:41:10

直连地址表

要理解哈希表，直接地址表是我们应该理解的第一个概念。

直接地址表直接使用键作为数组中槽的索引。宇宙键的大小等于数组的大小。在O(1)时间内访问这个键非常快，因为数组支持随机访问操作。

然而，在实现直接地址表之前，有四个注意事项:

要成为有效的数组索引，键应该是整数键的范围是相当小的，否则，我们将需要一个巨大的数组。不能将两个不同的键映射到数组中的同一个槽宇宙键的长度等于数组的长度

事实上，现实生活中并不是很多情况都符合上述要求，所以哈希表就可以救场了

哈希表

哈希表不是直接使用键，而是首先应用数学哈希函数将任意键数据一致地转换为数字，然后使用该哈希结果作为键。

宇宙键的长度可以大于数组的长度，这意味着两个不同的键可以散列到相同的索引(称为散列碰撞)?

实际上，有一些不同的策略来处理它。这里有一个常见的解决方案:我们不将实际值存储在数组中，而是存储一个指向链表的指针，该链表包含散列到该索引的所有键的值。

如果你仍然有兴趣知道如何从头开始实现hashmap，请阅读下面的帖子

2021-04-10 07:41:30

基本思路

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

在高层次上，哈希表是一种存储东西的方式，有点像衣服的梳妆台。其基本思想如下:

你有一些可以存储物品的位置(抽屉)。你想出一些规则，告诉你每件物品属于哪个位置(抽屉)。当你需要找东西时，你就用这个规则来决定要找哪个抽屉。

这样的系统的优点是，假设您的规则不是太复杂，并且您有适当数量的抽屉，您可以通过查找正确的位置来快速找到您要找的东西。

当你把衣服放好时，你使用的“规则”可能是“袜子放在左边最上面的抽屉里，衬衫放在中间的大抽屉里，等等。”当你存储更抽象的数据时，我们使用一种叫做哈希函数的东西来为我们做这件事。

考虑哈希函数的一种合理方式是将其视为一个黑盒。你把数据放在一边，一个叫做哈希码的数字从另一边出来。从示意图上看，它是这样的:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+