我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!

例如,我知道它接受键,计算哈希(我正在寻找一个解释),然后执行某种模运算来计算出它在存储值的数组中的位置,但我的知识到此为止。

谁能解释一下过程吗?

编辑:我并不是特别问哈希码是如何计算的,而是对哈希表如何工作的一般概述。


当前回答

基本思路

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

在高层次上,哈希表是一种存储东西的方式,有点像衣服的梳妆台。其基本思想如下:

你有一些可以存储物品的位置(抽屉)。 你想出一些规则,告诉你每件物品属于哪个位置(抽屉)。 当你需要找东西时,你就用这个规则来决定要找哪个抽屉。

这样的系统的优点是,假设您的规则不是太复杂,并且您有适当数量的抽屉,您可以通过查找正确的位置来快速找到您要找的东西。

当你把衣服放好时,你使用的“规则”可能是“袜子放在左边最上面的抽屉里,衬衫放在中间的大抽屉里,等等。”当你存储更抽象的数据时,我们使用一种叫做哈希函数的东西来为我们做这件事。

考虑哈希函数的一种合理方式是将其视为一个黑盒。你把数据放在一边,一个叫做哈希码的数字从另一边出来。从示意图上看,它是这样的:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+

All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

如何构建哈希函数呢?现在,让我们选择“正派的人不应该想太多”。数学家们已经想出了更好和更差的方法来设计哈希函数,但对于我们的目的,我们真的不需要太担心内部。把哈希函数看成是这样的函数就很好了

确定性的(相同的输入给出相同的输出),但是 看起来是随机的(很难预测一个哈希码给出另一个)。

Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

要在哈希表中查找内容,我们基本上使用相同的过程。我们首先计算要查找的项的哈希代码,它告诉我们要查找哪个桶(抽屉)。如果条目在表中,它就必须在那个bucket中。然后,我们只需查看桶中的所有项,看看我们的项是否在其中。

What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

细节

解释“哈希表”是如何工作的有点棘手,因为哈希表有很多种。下一节将讨论所有哈希表通用的一些通用实现细节,以及不同风格的哈希表如何工作的一些细节。

A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

(这里有时会使用位运算符。如果你的表的大小是2的幂,比如说2k,那么计算哈希码的位与,然后数字2k - 1相当于计算一个模数,而且它明显更快。)

下一个问题是如何选择正确的桶数。如果您选择太多的bucket,那么大多数bucket将是空的或只有很少的元素(对速度有好处-每个bucket只需要检查一些项),但是您将使用大量的空间来简单地存储bucket(不是很好,尽管也许您可以负担得起)。反之亦然——如果存储桶太少,那么每个存储桶平均会有更多的元素,这会使查找时间变长,但会减少内存使用量。

A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

哈希策略

到目前为止,我们一直在讨论链式哈希,这是构建哈希表的许多不同策略之一。提醒一下,链式哈希有点像一个服装梳妆台——每个桶(抽屉)可以容纳多个项目,当你进行查找时,你会检查所有这些项目。

然而,这并不是构建哈希表的唯一方法。还有另一类哈希表使用一种叫做开放寻址的策略。开放寻址的基本思想是存储一个槽数组,其中每个槽可以是空的,也可以只保存一项。

In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

对于布谷鸟哈希有许多改进的可能,例如使用多个表,让每个槽容纳多个项目,以及制作一个“隐藏”来保存其他地方无法容纳的项目,这是一个活跃的研究领域!

Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

进一步的阅读

有关哈希表基础知识的更多信息,请查看关于链式哈希的讲座幻灯片以及关于线性探测和罗宾汉哈希的后续幻灯片。你可以在这里学到更多关于布谷鸟哈希的知识,以及哈希函数的理论性质。

其他回答

基本思路

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

在高层次上,哈希表是一种存储东西的方式,有点像衣服的梳妆台。其基本思想如下:

你有一些可以存储物品的位置(抽屉)。 你想出一些规则,告诉你每件物品属于哪个位置(抽屉)。 当你需要找东西时,你就用这个规则来决定要找哪个抽屉。

这样的系统的优点是,假设您的规则不是太复杂,并且您有适当数量的抽屉,您可以通过查找正确的位置来快速找到您要找的东西。

当你把衣服放好时,你使用的“规则”可能是“袜子放在左边最上面的抽屉里,衬衫放在中间的大抽屉里,等等。”当你存储更抽象的数据时,我们使用一种叫做哈希函数的东西来为我们做这件事。

考虑哈希函数的一种合理方式是将其视为一个黑盒。你把数据放在一边,一个叫做哈希码的数字从另一边出来。从示意图上看,它是这样的:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+

All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

如何构建哈希函数呢?现在,让我们选择“正派的人不应该想太多”。数学家们已经想出了更好和更差的方法来设计哈希函数,但对于我们的目的,我们真的不需要太担心内部。把哈希函数看成是这样的函数就很好了

确定性的(相同的输入给出相同的输出),但是 看起来是随机的(很难预测一个哈希码给出另一个)。

Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

要在哈希表中查找内容,我们基本上使用相同的过程。我们首先计算要查找的项的哈希代码,它告诉我们要查找哪个桶(抽屉)。如果条目在表中,它就必须在那个bucket中。然后,我们只需查看桶中的所有项,看看我们的项是否在其中。

What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

细节

解释“哈希表”是如何工作的有点棘手,因为哈希表有很多种。下一节将讨论所有哈希表通用的一些通用实现细节,以及不同风格的哈希表如何工作的一些细节。

A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

(这里有时会使用位运算符。如果你的表的大小是2的幂,比如说2k,那么计算哈希码的位与,然后数字2k - 1相当于计算一个模数,而且它明显更快。)

下一个问题是如何选择正确的桶数。如果您选择太多的bucket,那么大多数bucket将是空的或只有很少的元素(对速度有好处-每个bucket只需要检查一些项),但是您将使用大量的空间来简单地存储bucket(不是很好,尽管也许您可以负担得起)。反之亦然——如果存储桶太少,那么每个存储桶平均会有更多的元素,这会使查找时间变长,但会减少内存使用量。

A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

哈希策略

到目前为止,我们一直在讨论链式哈希,这是构建哈希表的许多不同策略之一。提醒一下,链式哈希有点像一个服装梳妆台——每个桶(抽屉)可以容纳多个项目,当你进行查找时,你会检查所有这些项目。

然而,这并不是构建哈希表的唯一方法。还有另一类哈希表使用一种叫做开放寻址的策略。开放寻址的基本思想是存储一个槽数组,其中每个槽可以是空的,也可以只保存一项。

In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

对于布谷鸟哈希有许多改进的可能,例如使用多个表,让每个槽容纳多个项目,以及制作一个“隐藏”来保存其他地方无法容纳的项目,这是一个活跃的研究领域!

Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

进一步的阅读

有关哈希表基础知识的更多信息,请查看关于链式哈希的讲座幻灯片以及关于线性探测和罗宾汉哈希的后续幻灯片。你可以在这里学到更多关于布谷鸟哈希的知识,以及哈希函数的理论性质。

你们已经很接近完整地解释了这个问题,但是遗漏了一些东西。哈希表只是一个数组。数组本身将在每个槽中包含一些内容。至少要将哈希值或值本身存储在这个插槽中。除此之外,您还可以存储在此插槽上碰撞的值的链接/链表,或者您可以使用开放寻址方法。您还可以存储一个或多个指针,这些指针指向您希望从该槽中检索的其他数据。

It's important to note that the hashvalue itself generally does not indicate the slot into which to place the value. For example, a hashvalue might be a negative integer value. Obviously a negative number cannot point to an array location. Additionally, hash values will tend to many times be larger numbers than the slots available. Thus another calculation needs to be performed by the hashtable itself to figure out which slot the value should go into. This is done with a modulus math operation like:

uint slotIndex = hashValue % hashTableSize;

这个值是该值将要进入的槽。在开放寻址中,如果槽位已经被另一个哈希值和/或其他数据填充,将再次运行模运算来查找下一个槽:

slotIndex = (remainder + 1) % hashTableSize;

我想可能还有其他更高级的方法来确定槽索引,但这是我见过的最常见的方法……会对其他表现更好的公司感兴趣。

With the modulus method, if you have a table of say size 1000, any hashvalue that is between 1 and 1000 will go into the corresponding slot. Any Negative values, and any values greater than 1000 will be potentially colliding slot values. The chances of that happening depend both on your hashing method, as well as how many total items you add to the hash table. Generally, it's best practice to make the size of the hashtable such that the total number of values added to it is only equal to about 70% of its size. If your hash function does a good job of even distribution, you will generally encounter very few to no bucket/slot collisions and it will perform very quickly for both lookup and write operations. If the total number of values to add is not known in advance, make a good guesstimate using whatever means, and then resize your hashtable once the number of elements added to it reaches 70% of capacity.

我希望这对你有所帮助。

PS - In C# the GetHashCode() method is pretty slow and results in actual value collisions under a lot of conditions I've tested. For some real fun, build your own hashfunction and try to get it to NEVER collide on the specific data you are hashing, run faster than GetHashCode, and have a fairly even distribution. I've done this using long instead of int size hashcode values and it's worked quite well on up to 32 million entires hashvalues in the hashtable with 0 collisions. Unfortunately I can't share the code as it belongs to my employer... but I can reveal it is possible for certain data domains. When you can achieve this, the hashtable is VERY fast. :)

其实比这更简单。

哈希表不过是一个包含键/值对的向量数组(通常是稀疏数组)。此数组的最大大小通常小于哈希表中存储的数据类型的可能值集中的项数。

哈希算法用于根据将存储在数组中的项的值生成该数组的索引。

This is where storing vectors of key/value pairs in the array come in. Because the set of values that can be indexes in the array is typically smaller than the number of all possible values that the type can have, it is possible that your hash algorithm is going to generate the same value for two separate keys. A good hash algorithm will prevent this as much as possible (which is why it is relegated to the type usually because it has specific information which a general hash algorithm can't possibly know), but it's impossible to prevent.

因此,您可以使用多个键来生成相同的散列代码。当这种情况发生时,将遍历向量中的项,并在向量中的键和正在查找的键之间进行直接比较。如果找到,则返回与该键关联的值,否则不返回任何值。

Hashtable inside contains cans in which it stores the key sets. The Hashtable uses the hashcode to decide to which the key pair should plan. The capacity to get the container area from Key's hashcode is known as hash work. In principle, a hash work is a capacity which when given a key, creates an address in the table. A hash work consistently returns a number for an item. Two equivalent items will consistently have a similar number while two inconsistent objects may not generally have various numbers. When we put objects into a hashtable then it is conceivable that various objects may have equal/ same hashcode. This is known as a collision. To determine collision, hashtable utilizes a variety of lists. The sets mapped to a single array index are stored in a list and then the list reference is stored in the index.

用法和行话:

哈希表用于快速存储和检索数据(或记录)。 记录使用散列键存储在桶中 哈希键是通过对记录中包含的选定值(键值)应用哈希算法来计算的。所选值必须是所有记录的公共值。 每个桶可以有多条记录,这些记录按照特定的顺序组织。

现实世界的例子:

哈希公司成立于1803年,当时没有任何计算机技术,只有300个文件柜来保存大约3万名客户的详细信息(记录)。每个文件夹都清楚地标识其客户端编号,从0到29,999的唯一编号。

当时的档案管理员必须迅速为工作人员获取和存储客户记录。工作人员决定使用哈希方法来存储和检索他们的记录会更有效。

要归档客户记录,档案管理员将使用写在文件夹上的唯一客户编号。使用这个客户端编号,他们将哈希键调整300,以识别包含它的文件柜。当他们打开文件柜时,他们会发现里面有很多按客户号排序的文件夹。在确定正确的位置后,他们会简单地把它塞进去。

要检索客户记录,档案管理员将在一张纸上获得客户号码。使用这个唯一的客户端编号(哈希键),他们会将其调整300,以确定哪个文件柜拥有客户端文件夹。当他们打开文件柜时,他们会发现里面有很多按客户号排序的文件夹。通过搜索记录,他们可以快速找到客户端文件夹并检索它。

在我们的实际示例中,桶是文件柜,记录是文件夹。


需要记住的一件重要的事情是,计算机(及其算法)处理数字比处理字符串更好。因此,使用索引访问大型数组要比按顺序访问快得多。

正如Simon提到的,我认为非常重要的是哈希部分是转换一个大空间(任意长度,通常是字符串等),并将其映射到一个小空间(已知大小,通常是数字)进行索引。记住这一点非常重要!

因此,在上面的示例中,大约30,000个可能的客户机被映射到一个较小的空间中。


这样做的主要思想是将整个数据集划分为几个部分,以加快实际搜索的速度,而实际搜索通常是耗时的。在我们上面的例子中,300个文件柜中的每个(统计上)将包含大约100条记录。搜索100条记录(不管顺序)要比处理3万条记录快得多。

你可能已经注意到有些人已经这样做了。但是,在大多数情况下,他们只是使用姓氏的第一个字母,而不是设计一个哈希方法来生成哈希键。因此,如果您有26个文件柜,每个文件柜都包含从a到Z的一个字母,理论上您只是将数据分割并增强了归档和检索过程。