哈希表是如何工作的?

我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!

例如，我知道它接受键，计算哈希(我正在寻找一个解释)，然后执行某种模运算来计算出它在存储值的数组中的位置，但我的知识到此为止。

谁能解释一下过程吗?

编辑:我并不是特别问哈希码是如何计算的，而是对哈希表如何工作的一般概述。

当前回答

你取一堆东西，和一个数组。

对于每一个东西，你为它建立一个索引，称为哈希。关于哈希的重要事情是它“分散”了很多;你不希望两个相似的东西有相似的哈希值。

你把东西放到数组中哈希值表示的位置。在一个给定的哈希中可以有多个对象，所以你可以将这些对象存储在数组或其他合适的东西中，我们通常称之为bucket。

当你在哈希中查找东西时，你会经历相同的步骤，计算哈希值，然后查看那个位置的bucket中有什么，并检查它是否是你要寻找的东西。

当你的哈希工作得很好并且你的数组足够大时，在数组的任何特定下标处最多只会有很少的东西，所以你不需要看太多。

额外的好处是，当你的哈希表被访问时，它会把找到的东西(如果有的话)移动到桶的开头，这样下次它就会是第一个被检查的东西。

2009-04-08 16:22:54

其他回答

Hashtable inside contains cans in which it stores the key sets. The Hashtable uses the hashcode to decide to which the key pair should plan. The capacity to get the container area from Key's hashcode is known as hash work. In principle, a hash work is a capacity which when given a key, creates an address in the table. A hash work consistently returns a number for an item. Two equivalent items will consistently have a similar number while two inconsistent objects may not generally have various numbers. When we put objects into a hashtable then it is conceivable that various objects may have equal/ same hashcode. This is known as a collision. To determine collision, hashtable utilizes a variety of lists. The sets mapped to a single array index are stored in a list and then the list reference is stored in the index.

2021-11-30 15:01:54

哈希的计算方式通常不取决于哈希表，而是取决于添加到哈希表中的项。在框架/基类库(如。net和Java)中，每个对象都有一个GetHashCode()(或类似)方法，返回该对象的哈希码。理想的哈希码算法和准确的实现取决于对象中表示的数据。

2009-04-08 15:52:27

用法和行话:

哈希表用于快速存储和检索数据(或记录)。记录使用散列键存储在桶中哈希键是通过对记录中包含的选定值(键值)应用哈希算法来计算的。所选值必须是所有记录的公共值。每个桶可以有多条记录，这些记录按照特定的顺序组织。

现实世界的例子:

哈希公司成立于1803年，当时没有任何计算机技术，只有300个文件柜来保存大约3万名客户的详细信息(记录)。每个文件夹都清楚地标识其客户端编号，从0到29,999的唯一编号。

当时的档案管理员必须迅速为工作人员获取和存储客户记录。工作人员决定使用哈希方法来存储和检索他们的记录会更有效。

要归档客户记录，档案管理员将使用写在文件夹上的唯一客户编号。使用这个客户端编号，他们将哈希键调整300，以识别包含它的文件柜。当他们打开文件柜时，他们会发现里面有很多按客户号排序的文件夹。在确定正确的位置后，他们会简单地把它塞进去。

要检索客户记录，档案管理员将在一张纸上获得客户号码。使用这个唯一的客户端编号(哈希键)，他们会将其调整300，以确定哪个文件柜拥有客户端文件夹。当他们打开文件柜时，他们会发现里面有很多按客户号排序的文件夹。通过搜索记录，他们可以快速找到客户端文件夹并检索它。

在我们的实际示例中，桶是文件柜，记录是文件夹。

需要记住的一件重要的事情是，计算机(及其算法)处理数字比处理字符串更好。因此，使用索引访问大型数组要比按顺序访问快得多。

正如Simon提到的，我认为非常重要的是哈希部分是转换一个大空间(任意长度，通常是字符串等)，并将其映射到一个小空间(已知大小，通常是数字)进行索引。记住这一点非常重要!

因此，在上面的示例中，大约30,000个可能的客户机被映射到一个较小的空间中。

这样做的主要思想是将整个数据集划分为几个部分，以加快实际搜索的速度，而实际搜索通常是耗时的。在我们上面的例子中，300个文件柜中的每个(统计上)将包含大约100条记录。搜索100条记录(不管顺序)要比处理3万条记录快得多。

你可能已经注意到有些人已经这样做了。但是，在大多数情况下，他们只是使用姓氏的第一个字母，而不是设计一个哈希方法来生成哈希键。因此，如果您有26个文件柜，每个文件柜都包含从a到Z的一个字母，理论上您只是将数据分割并增强了归档和检索过程。

2009-04-08 17:20:00

这是另一种看待它的方式。

我假设你理解数组A的概念，它支持索引操作，你可以一步找到第I个元素，A[I]，不管A有多大。

因此，例如，如果您想存储一组恰好年龄不同的人的信息，一个简单的方法是有一个足够大的数组，并使用每个人的年龄作为数组的索引。这样，你就可以一步获取任何人的信息。

But of course there could be more than one person with the same age, so what you put in the array at each entry is a list of all the people who have that age. So you can get to an individual person's information in one step plus a little bit of search in that list (called a "bucket"). It only slows down if there are so many people that the buckets get big. Then you need a larger array, and some other way to get more identifying information about the person, like the first few letters of their surname, instead of using age.

这是基本思想。不使用年龄，可以使用任何能产生良好价值观传播的人的函数。这就是哈希函数。比如你可以把这个人名字的ASCII表示的每三分之一，按某种顺序打乱。重要的是，您不希望太多人散列到同一个存储桶，因为速度取决于存储桶保持较小。

2009-04-08 17:44:33

基本思路

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

在高层次上，哈希表是一种存储东西的方式，有点像衣服的梳妆台。其基本思想如下:

你有一些可以存储物品的位置(抽屉)。你想出一些规则，告诉你每件物品属于哪个位置(抽屉)。当你需要找东西时，你就用这个规则来决定要找哪个抽屉。

这样的系统的优点是，假设您的规则不是太复杂，并且您有适当数量的抽屉，您可以通过查找正确的位置来快速找到您要找的东西。

当你把衣服放好时，你使用的“规则”可能是“袜子放在左边最上面的抽屉里，衬衫放在中间的大抽屉里，等等。”当你存储更抽象的数据时，我们使用一种叫做哈希函数的东西来为我们做这件事。

考虑哈希函数的一种合理方式是将其视为一个黑盒。你把数据放在一边，一个叫做哈希码的数字从另一边出来。从示意图上看，它是这样的:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+

All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

如何构建哈希函数呢?现在，让我们选择“正派的人不应该想太多”。数学家们已经想出了更好和更差的方法来设计哈希函数，但对于我们的目的，我们真的不需要太担心内部。把哈希函数看成是这样的函数就很好了

确定性的(相同的输入给出相同的输出)，但是看起来是随机的(很难预测一个哈希码给出另一个)。

Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

要在哈希表中查找内容，我们基本上使用相同的过程。我们首先计算要查找的项的哈希代码，它告诉我们要查找哪个桶(抽屉)。如果条目在表中，它就必须在那个bucket中。然后，我们只需查看桶中的所有项，看看我们的项是否在其中。

What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

细节

解释“哈希表”是如何工作的有点棘手，因为哈希表有很多种。下一节将讨论所有哈希表通用的一些通用实现细节，以及不同风格的哈希表如何工作的一些细节。

A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

(这里有时会使用位运算符。如果你的表的大小是2的幂，比如说2k，那么计算哈希码的位与，然后数字2k - 1相当于计算一个模数，而且它明显更快。)

下一个问题是如何选择正确的桶数。如果您选择太多的bucket，那么大多数bucket将是空的或只有很少的元素(对速度有好处-每个bucket只需要检查一些项)，但是您将使用大量的空间来简单地存储bucket(不是很好，尽管也许您可以负担得起)。反之亦然——如果存储桶太少，那么每个存储桶平均会有更多的元素，这会使查找时间变长，但会减少内存使用量。

A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

哈希策略

到目前为止，我们一直在讨论链式哈希，这是构建哈希表的许多不同策略之一。提醒一下，链式哈希有点像一个服装梳妆台——每个桶(抽屉)可以容纳多个项目，当你进行查找时，你会检查所有这些项目。

然而，这并不是构建哈希表的唯一方法。还有另一类哈希表使用一种叫做开放寻址的策略。开放寻址的基本思想是存储一个槽数组，其中每个槽可以是空的，也可以只保存一项。

In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

对于布谷鸟哈希有许多改进的可能，例如使用多个表，让每个槽容纳多个项目，以及制作一个“隐藏”来保存其他地方无法容纳的项目，这是一个活跃的研究领域!

Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

进一步的阅读

有关哈希表基础知识的更多信息，请查看关于链式哈希的讲座幻灯片以及关于线性探测和罗宾汉哈希的后续幻灯片。你可以在这里学到更多关于布谷鸟哈希的知识，以及哈希函数的理论性质。

2020-09-11 20:48:06

哈希表是如何工作的?

推荐文章

最新文章

标签