哈希表是如何工作的?

我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!

例如，我知道它接受键，计算哈希(我正在寻找一个解释)，然后执行某种模运算来计算出它在存储值的数组中的位置，但我的知识到此为止。

谁能解释一下过程吗?

编辑:我并不是特别问哈希码是如何计算的，而是对哈希表如何工作的一般概述。

当前回答

对于所有寻找编程用语的人，下面是它是如何工作的。高级哈希表的内部实现有许多复杂之处，并且对存储分配/释放和搜索进行了优化，但顶层的思想是非常相同的。

(void) addValue : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   if (bucket) 
   {
       //do nothing, just overwrite
   }
   else   //create bucket
   {
      create_extra_space_for_bucket();
   }
   put_value_into_bucket(bucket,value);
}

(bool) exists : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   return bucket;
}

其中calculate_bucket_from_val()是哈希函数，所有的惟一性魔术都必须在这里发生。

经验法则是: 对于要插入的给定值，bucket必须是唯一的，并且派生自它应该存储的值。

Bucket是存储值的任何空间-这里我将它保持int作为数组索引，但它也可能是一个内存位置。

2015-10-07 11:11:20

其他回答

其实比这更简单。

哈希表不过是一个包含键/值对的向量数组(通常是稀疏数组)。此数组的最大大小通常小于哈希表中存储的数据类型的可能值集中的项数。

哈希算法用于根据将存储在数组中的项的值生成该数组的索引。

This is where storing vectors of key/value pairs in the array come in. Because the set of values that can be indexes in the array is typically smaller than the number of all possible values that the type can have, it is possible that your hash algorithm is going to generate the same value for two separate keys. A good hash algorithm will prevent this as much as possible (which is why it is relegated to the type usually because it has specific information which a general hash algorithm can't possibly know), but it's impossible to prevent.

因此，您可以使用多个键来生成相同的散列代码。当这种情况发生时，将遍历向量中的项，并在向量中的键和正在查找的键之间进行直接比较。如果找到，则返回与该键关联的值，否则不返回任何值。

2009-04-08 16:04:43

基本思路

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

在高层次上，哈希表是一种存储东西的方式，有点像衣服的梳妆台。其基本思想如下:

你有一些可以存储物品的位置(抽屉)。你想出一些规则，告诉你每件物品属于哪个位置(抽屉)。当你需要找东西时，你就用这个规则来决定要找哪个抽屉。

这样的系统的优点是，假设您的规则不是太复杂，并且您有适当数量的抽屉，您可以通过查找正确的位置来快速找到您要找的东西。

当你把衣服放好时，你使用的“规则”可能是“袜子放在左边最上面的抽屉里，衬衫放在中间的大抽屉里，等等。”当你存储更抽象的数据时，我们使用一种叫做哈希函数的东西来为我们做这件事。

考虑哈希函数的一种合理方式是将其视为一个黑盒。你把数据放在一边，一个叫做哈希码的数字从另一边出来。从示意图上看，它是这样的:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+

All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

如何构建哈希函数呢?现在，让我们选择“正派的人不应该想太多”。数学家们已经想出了更好和更差的方法来设计哈希函数，但对于我们的目的，我们真的不需要太担心内部。把哈希函数看成是这样的函数就很好了

确定性的(相同的输入给出相同的输出)，但是看起来是随机的(很难预测一个哈希码给出另一个)。

Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

要在哈希表中查找内容，我们基本上使用相同的过程。我们首先计算要查找的项的哈希代码，它告诉我们要查找哪个桶(抽屉)。如果条目在表中，它就必须在那个bucket中。然后，我们只需查看桶中的所有项，看看我们的项是否在其中。

What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

细节

解释“哈希表”是如何工作的有点棘手，因为哈希表有很多种。下一节将讨论所有哈希表通用的一些通用实现细节，以及不同风格的哈希表如何工作的一些细节。

A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

(这里有时会使用位运算符。如果你的表的大小是2的幂，比如说2k，那么计算哈希码的位与，然后数字2k - 1相当于计算一个模数，而且它明显更快。)

下一个问题是如何选择正确的桶数。如果您选择太多的bucket，那么大多数bucket将是空的或只有很少的元素(对速度有好处-每个bucket只需要检查一些项)，但是您将使用大量的空间来简单地存储bucket(不是很好，尽管也许您可以负担得起)。反之亦然——如果存储桶太少，那么每个存储桶平均会有更多的元素，这会使查找时间变长，但会减少内存使用量。

A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

哈希策略

到目前为止，我们一直在讨论链式哈希，这是构建哈希表的许多不同策略之一。提醒一下，链式哈希有点像一个服装梳妆台——每个桶(抽屉)可以容纳多个项目，当你进行查找时，你会检查所有这些项目。

然而，这并不是构建哈希表的唯一方法。还有另一类哈希表使用一种叫做开放寻址的策略。开放寻址的基本思想是存储一个槽数组，其中每个槽可以是空的，也可以只保存一项。

In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

对于布谷鸟哈希有许多改进的可能，例如使用多个表，让每个槽容纳多个项目，以及制作一个“隐藏”来保存其他地方无法容纳的项目，这是一个活跃的研究领域!

Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

进一步的阅读

有关哈希表基础知识的更多信息，请查看关于链式哈希的讲座幻灯片以及关于线性探测和罗宾汉哈希的后续幻灯片。你可以在这里学到更多关于布谷鸟哈希的知识，以及哈希函数的理论性质。

2020-09-11 20:48:06

这是另一种看待它的方式。

我假设你理解数组A的概念，它支持索引操作，你可以一步找到第I个元素，A[I]，不管A有多大。

因此，例如，如果您想存储一组恰好年龄不同的人的信息，一个简单的方法是有一个足够大的数组，并使用每个人的年龄作为数组的索引。这样，你就可以一步获取任何人的信息。

But of course there could be more than one person with the same age, so what you put in the array at each entry is a list of all the people who have that age. So you can get to an individual person's information in one step plus a little bit of search in that list (called a "bucket"). It only slows down if there are so many people that the buckets get big. Then you need a larger array, and some other way to get more identifying information about the person, like the first few letters of their surname, instead of using age.

这是基本思想。不使用年龄，可以使用任何能产生良好价值观传播的人的函数。这就是哈希函数。比如你可以把这个人名字的ASCII表示的每三分之一，按某种顺序打乱。重要的是，您不希望太多人散列到同一个存储桶，因为速度取决于存储桶保持较小。

2009-04-08 17:44:33

我的理解是这样的:

这里有一个例子:把整个表想象成一系列的桶。假设您有一个带有字母-数字哈希码的实现，并且每个字母都有一个存储桶。该实现将哈希码以特定字母开头的每个项放入相应的bucket中。

假设你有200个对象，但只有15个对象的哈希码以字母“B”开头。哈希表只需要查找和搜索'B' bucket中的15个对象，而不是所有200个对象。

至于计算哈希码，没有什么神奇的。目标只是让不同的对象返回不同的代码，对于相同的对象返回相同的代码。您可以编写一个类，它总是为所有实例返回相同的整数作为哈希代码，但这实际上会破坏哈希表的用处，因为它只会变成一个巨大的桶。

2009-04-08 16:02:32

你们已经很接近完整地解释了这个问题，但是遗漏了一些东西。哈希表只是一个数组。数组本身将在每个槽中包含一些内容。至少要将哈希值或值本身存储在这个插槽中。除此之外，您还可以存储在此插槽上碰撞的值的链接/链表，或者您可以使用开放寻址方法。您还可以存储一个或多个指针，这些指针指向您希望从该槽中检索的其他数据。

It's important to note that the hashvalue itself generally does not indicate the slot into which to place the value. For example, a hashvalue might be a negative integer value. Obviously a negative number cannot point to an array location. Additionally, hash values will tend to many times be larger numbers than the slots available. Thus another calculation needs to be performed by the hashtable itself to figure out which slot the value should go into. This is done with a modulus math operation like:

uint slotIndex = hashValue % hashTableSize;

这个值是该值将要进入的槽。在开放寻址中，如果槽位已经被另一个哈希值和/或其他数据填充，将再次运行模运算来查找下一个槽:

slotIndex = (remainder + 1) % hashTableSize;

我想可能还有其他更高级的方法来确定槽索引，但这是我见过的最常见的方法……会对其他表现更好的公司感兴趣。

With the modulus method, if you have a table of say size 1000, any hashvalue that is between 1 and 1000 will go into the corresponding slot. Any Negative values, and any values greater than 1000 will be potentially colliding slot values. The chances of that happening depend both on your hashing method, as well as how many total items you add to the hash table. Generally, it's best practice to make the size of the hashtable such that the total number of values added to it is only equal to about 70% of its size. If your hash function does a good job of even distribution, you will generally encounter very few to no bucket/slot collisions and it will perform very quickly for both lookup and write operations. If the total number of values to add is not known in advance, make a good guesstimate using whatever means, and then resize your hashtable once the number of elements added to it reaches 70% of capacity.

我希望这对你有所帮助。

PS - In C# the GetHashCode() method is pretty slow and results in actual value collisions under a lot of conditions I've tested. For some real fun, build your own hashfunction and try to get it to NEVER collide on the specific data you are hashing, run faster than GetHashCode, and have a fairly even distribution. I've done this using long instead of int size hashcode values and it's worked quite well on up to 32 million entires hashvalues in the hashtable with 0 collisions. Unfortunately I can't share the code as it belongs to my employer... but I can reveal it is possible for certain data domains. When you can achieve this, the hashtable is VERY fast. :)

2010-05-15 01:41:55

哈希表是如何工作的?

推荐文章

最新文章

标签