我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!

例如,我知道它接受键,计算哈希(我正在寻找一个解释),然后执行某种模运算来计算出它在存储值的数组中的位置,但我的知识到此为止。

谁能解释一下过程吗?

编辑:我并不是特别问哈希码是如何计算的,而是对哈希表如何工作的一般概述。


当前回答

对于所有寻找编程用语的人,下面是它是如何工作的。高级哈希表的内部实现有许多复杂之处,并且对存储分配/释放和搜索进行了优化,但顶层的思想是非常相同的。

(void) addValue : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   if (bucket) 
   {
       //do nothing, just overwrite
   }
   else   //create bucket
   {
      create_extra_space_for_bucket();
   }
   put_value_into_bucket(bucket,value);
}

(bool) exists : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   return bucket;
}

其中calculate_bucket_from_val()是哈希函数,所有的惟一性魔术都必须在这里发生。

经验法则是: 对于要插入的给定值,bucket必须是唯一的,并且派生自它应该存储的值。

Bucket是存储值的任何空间-这里我将它保持int作为数组索引,但它也可能是一个内存位置。

其他回答

这是一个外行的解释。

让我们假设你想要用书填满一个图书馆,而不仅仅是把它们塞进去,而且你希望在你需要它们的时候能够很容易地再次找到它们。

因此,您决定,如果想要阅读一本书的人知道书名和确切的书名,那么这就是所有应该做的。有了书名,在图书管理员的帮助下,读者就能轻松快速地找到这本书。

那么,你该怎么做呢?当然,你可以列出你把每本书放在哪里的列表,但是你会遇到和搜索图书馆一样的问题,你需要搜索列表。当然,列表会更小,更容易搜索,但您仍然不希望从库(或列表)的一端到另一端依次搜索。

你想要的东西,有了书名,就能立刻给你正确的位置,所以你所要做的就是漫步到正确的书架上,拿起书。

但这怎么能做到呢?嗯,当你填满图书馆的时候要有一点先见之明,当你填满图书馆的时候要做很多工作。

你设计了一个聪明的小方法,而不是开始从一端到另一端填满这个库。你拿着书名,在一个小的计算机程序中运行,它会显示出书架的编号和书架上的槽号。这是你放书的地方。

这个程序的美妙之处在于,稍后,当一个人回来阅读这本书时,您再次通过程序输入标题,并获得与最初给您的相同的书架编号和插槽编号,这就是书的位置。

正如其他人已经提到的,这个程序被称为哈希算法或哈希计算,通常通过输入数据(在这种情况下是书名)并从中计算一个数字来工作。

为了简单起见,我们假设它只是将每个字母和符号转换为一个数字,并将它们全部相加。实际上,它要比这复杂得多,但现在让我们先把它放在这里。

这种算法的美妙之处在于,如果你一次又一次地向它输入相同的输入,它每次都会输出相同的数字。

这就是哈希表的基本工作原理。

接下来是技术方面的内容。

首先是数字的大小。通常,这种哈希算法的输出在一个较大的数字范围内,通常比表中的空间大得多。例如,假设我们的图书馆刚好有100万本书的空间。哈希计算的输出可以在0到10亿的范围内,这要高得多。

那么,我们该怎么办呢?我们使用所谓的模量计算,它基本上是说,如果你数到你想要的数字(即10亿数字),但想要保持在一个小得多的范围内,每次你达到这个小范围的极限,你就从0开始,但你必须跟踪你在大序列中走了多远。

假设哈希算法的输出在0到20的范围内,并且从特定的标题中获得值17。如果图书馆的大小只有7本书,你数1、2、3、4、5、6,当你数到7时,你从0开始。因为我们需要数17次,所以我们有1、2、3、4、5、6、0、1、2、3、4、5、6、0、1、2、3,最后的数字是3。

当然模量的计算不是这样的,它是用除法和余数来完成的。17除以7的余数是3(17除7得14,17和14之差是3)。

因此,你把书放在3号槽里。

这就导致了下一个问题。碰撞。由于该算法无法将图书间隔开来以使它们完全填满库(或者填满哈希表),因此它最终总是会计算一个以前使用过的数字。在图书馆的意义上,当你到达书架和你想放一本书的槽号时,那里已经有一本书了。

存在各种冲突处理方法,包括将数据运行到另一个计算中以获得表中的另一个位置(双重哈希),或者只是在给定的位置附近找到一个空间(例如,就在前一本书的旁边,假设插槽可用,也称为线性探测)。这意味着当你稍后试图找到这本书时,你需要做一些挖掘工作,但这仍然比简单地从图书馆的一端开始要好。

最后,在某些情况下,您可能希望将更多的书放入图书馆,而不是图书馆所允许的。换句话说,你需要建立一个更大的库。由于图书馆中的确切位置是使用图书馆的确切和当前大小计算出来的,因此,如果您调整了图书馆的大小,那么您可能最终不得不为所有书籍找到新的位置,因为为找到它们的位置所做的计算已经改变了。

我希望这个解释比桶和函数更接地气一点:)

基本思路

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

在高层次上,哈希表是一种存储东西的方式,有点像衣服的梳妆台。其基本思想如下:

你有一些可以存储物品的位置(抽屉)。 你想出一些规则,告诉你每件物品属于哪个位置(抽屉)。 当你需要找东西时,你就用这个规则来决定要找哪个抽屉。

这样的系统的优点是,假设您的规则不是太复杂,并且您有适当数量的抽屉,您可以通过查找正确的位置来快速找到您要找的东西。

当你把衣服放好时,你使用的“规则”可能是“袜子放在左边最上面的抽屉里,衬衫放在中间的大抽屉里,等等。”当你存储更抽象的数据时,我们使用一种叫做哈希函数的东西来为我们做这件事。

考虑哈希函数的一种合理方式是将其视为一个黑盒。你把数据放在一边,一个叫做哈希码的数字从另一边出来。从示意图上看,它是这样的:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+

All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

如何构建哈希函数呢?现在,让我们选择“正派的人不应该想太多”。数学家们已经想出了更好和更差的方法来设计哈希函数,但对于我们的目的,我们真的不需要太担心内部。把哈希函数看成是这样的函数就很好了

确定性的(相同的输入给出相同的输出),但是 看起来是随机的(很难预测一个哈希码给出另一个)。

Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

要在哈希表中查找内容,我们基本上使用相同的过程。我们首先计算要查找的项的哈希代码,它告诉我们要查找哪个桶(抽屉)。如果条目在表中,它就必须在那个bucket中。然后,我们只需查看桶中的所有项,看看我们的项是否在其中。

What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

细节

解释“哈希表”是如何工作的有点棘手,因为哈希表有很多种。下一节将讨论所有哈希表通用的一些通用实现细节,以及不同风格的哈希表如何工作的一些细节。

A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

(这里有时会使用位运算符。如果你的表的大小是2的幂,比如说2k,那么计算哈希码的位与,然后数字2k - 1相当于计算一个模数,而且它明显更快。)

下一个问题是如何选择正确的桶数。如果您选择太多的bucket,那么大多数bucket将是空的或只有很少的元素(对速度有好处-每个bucket只需要检查一些项),但是您将使用大量的空间来简单地存储bucket(不是很好,尽管也许您可以负担得起)。反之亦然——如果存储桶太少,那么每个存储桶平均会有更多的元素,这会使查找时间变长,但会减少内存使用量。

A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

哈希策略

到目前为止,我们一直在讨论链式哈希,这是构建哈希表的许多不同策略之一。提醒一下,链式哈希有点像一个服装梳妆台——每个桶(抽屉)可以容纳多个项目,当你进行查找时,你会检查所有这些项目。

然而,这并不是构建哈希表的唯一方法。还有另一类哈希表使用一种叫做开放寻址的策略。开放寻址的基本思想是存储一个槽数组,其中每个槽可以是空的,也可以只保存一项。

In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

对于布谷鸟哈希有许多改进的可能,例如使用多个表,让每个槽容纳多个项目,以及制作一个“隐藏”来保存其他地方无法容纳的项目,这是一个活跃的研究领域!

Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

进一步的阅读

有关哈希表基础知识的更多信息,请查看关于链式哈希的讲座幻灯片以及关于线性探测和罗宾汉哈希的后续幻灯片。你可以在这里学到更多关于布谷鸟哈希的知识,以及哈希函数的理论性质。

直连地址表

要理解哈希表,直接地址表是我们应该理解的第一个概念。

直接地址表直接使用键作为数组中槽的索引。宇宙键的大小等于数组的大小。在O(1)时间内访问这个键非常快,因为数组支持随机访问操作。

然而,在实现直接地址表之前,有四个注意事项:

要成为有效的数组索引,键应该是整数 键的范围是相当小的,否则,我们将需要一个巨大的数组。 不能将两个不同的键映射到数组中的同一个槽 宇宙键的长度等于数组的长度

事实上,现实生活中并不是很多情况都符合上述要求,所以哈希表就可以救场了

哈希表

哈希表不是直接使用键,而是首先应用数学哈希函数将任意键数据一致地转换为数字,然后使用该哈希结果作为键。

宇宙键的长度可以大于数组的长度,这意味着两个不同的键可以散列到相同的索引(称为散列碰撞)?

实际上,有一些不同的策略来处理它。这里有一个常见的解决方案:我们不将实际值存储在数组中,而是存储一个指向链表的指针,该链表包含散列到该索引的所有键的值。

如果你仍然有兴趣知道如何从头开始实现hashmap,请阅读下面的帖子

对于所有寻找编程用语的人,下面是它是如何工作的。高级哈希表的内部实现有许多复杂之处,并且对存储分配/释放和搜索进行了优化,但顶层的思想是非常相同的。

(void) addValue : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   if (bucket) 
   {
       //do nothing, just overwrite
   }
   else   //create bucket
   {
      create_extra_space_for_bucket();
   }
   put_value_into_bucket(bucket,value);
}

(bool) exists : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   return bucket;
}

其中calculate_bucket_from_val()是哈希函数,所有的惟一性魔术都必须在这里发生。

经验法则是: 对于要插入的给定值,bucket必须是唯一的,并且派生自它应该存储的值。

Bucket是存储值的任何空间-这里我将它保持int作为数组索引,但它也可能是一个内存位置。

简短而甜蜜:

哈希表封装了一个数组,我们称之为internalArray。将项以如下方式插入数组:

let insert key value =
    internalArray[hash(key) % internalArray.Length] <- (key, value)
    //oversimplified for educational purposes

有时两个键会散列到数组中的同一个索引,而您希望保留这两个值。我喜欢把两个值都存储在同一个索引中,通过将internalArray作为一个链表数组来编码很简单:

let insert key value =
    internalArray[hash(key) % internalArray.Length].AddLast(key, value)

所以,如果我想从哈希表中检索一个项,我可以这样写:

let get key =
    let linkedList = internalArray[hash(key) % internalArray.Length]
    for (testKey, value) in linkedList
        if (testKey = key) then return value
    return null

删除操作写起来也很简单。正如你所知道的,从我们的链表数组中插入、查找和删除几乎是O(1)。

当我们的internalArray太满时,可能在85%左右的容量,我们可以调整内部数组的大小,并将所有项目从旧数组移动到新数组中。