哈希表是如何工作的?

我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!

例如，我知道它接受键，计算哈希(我正在寻找一个解释)，然后执行某种模运算来计算出它在存储值的数组中的位置，但我的知识到此为止。

谁能解释一下过程吗?

编辑:我并不是特别问哈希码是如何计算的，而是对哈希表如何工作的一般概述。

当前回答

这是另一种看待它的方式。

我假设你理解数组A的概念，它支持索引操作，你可以一步找到第I个元素，A[I]，不管A有多大。

因此，例如，如果您想存储一组恰好年龄不同的人的信息，一个简单的方法是有一个足够大的数组，并使用每个人的年龄作为数组的索引。这样，你就可以一步获取任何人的信息。

But of course there could be more than one person with the same age, so what you put in the array at each entry is a list of all the people who have that age. So you can get to an individual person's information in one step plus a little bit of search in that list (called a "bucket"). It only slows down if there are so many people that the buckets get big. Then you need a larger array, and some other way to get more identifying information about the person, like the first few letters of their surname, instead of using age.

这是基本思想。不使用年龄，可以使用任何能产生良好价值观传播的人的函数。这就是哈希函数。比如你可以把这个人名字的ASCII表示的每三分之一，按某种顺序打乱。重要的是，您不希望太多人散列到同一个存储桶，因为速度取决于存储桶保持较小。

2009-04-08 17:44:33

其他回答

这是一个外行的解释。

让我们假设你想要用书填满一个图书馆，而不仅仅是把它们塞进去，而且你希望在你需要它们的时候能够很容易地再次找到它们。

因此，您决定，如果想要阅读一本书的人知道书名和确切的书名，那么这就是所有应该做的。有了书名，在图书管理员的帮助下，读者就能轻松快速地找到这本书。

那么，你该怎么做呢?当然，你可以列出你把每本书放在哪里的列表，但是你会遇到和搜索图书馆一样的问题，你需要搜索列表。当然，列表会更小，更容易搜索，但您仍然不希望从库(或列表)的一端到另一端依次搜索。

你想要的东西，有了书名，就能立刻给你正确的位置，所以你所要做的就是漫步到正确的书架上，拿起书。

但这怎么能做到呢?嗯，当你填满图书馆的时候要有一点先见之明，当你填满图书馆的时候要做很多工作。

你设计了一个聪明的小方法，而不是开始从一端到另一端填满这个库。你拿着书名，在一个小的计算机程序中运行，它会显示出书架的编号和书架上的槽号。这是你放书的地方。

这个程序的美妙之处在于，稍后，当一个人回来阅读这本书时，您再次通过程序输入标题，并获得与最初给您的相同的书架编号和插槽编号，这就是书的位置。

正如其他人已经提到的，这个程序被称为哈希算法或哈希计算，通常通过输入数据(在这种情况下是书名)并从中计算一个数字来工作。

为了简单起见，我们假设它只是将每个字母和符号转换为一个数字，并将它们全部相加。实际上，它要比这复杂得多，但现在让我们先把它放在这里。

这种算法的美妙之处在于，如果你一次又一次地向它输入相同的输入，它每次都会输出相同的数字。

这就是哈希表的基本工作原理。

接下来是技术方面的内容。

首先是数字的大小。通常，这种哈希算法的输出在一个较大的数字范围内，通常比表中的空间大得多。例如，假设我们的图书馆刚好有100万本书的空间。哈希计算的输出可以在0到10亿的范围内，这要高得多。

那么，我们该怎么办呢?我们使用所谓的模量计算，它基本上是说，如果你数到你想要的数字(即10亿数字)，但想要保持在一个小得多的范围内，每次你达到这个小范围的极限，你就从0开始，但你必须跟踪你在大序列中走了多远。

假设哈希算法的输出在0到20的范围内，并且从特定的标题中获得值17。如果图书馆的大小只有7本书，你数1、2、3、4、5、6，当你数到7时，你从0开始。因为我们需要数17次，所以我们有1、2、3、4、5、6、0、1、2、3、4、5、6、0、1、2、3，最后的数字是3。

当然模量的计算不是这样的，它是用除法和余数来完成的。17除以7的余数是3(17除7得14,17和14之差是3)。

因此，你把书放在3号槽里。

这就导致了下一个问题。碰撞。由于该算法无法将图书间隔开来以使它们完全填满库(或者填满哈希表)，因此它最终总是会计算一个以前使用过的数字。在图书馆的意义上，当你到达书架和你想放一本书的槽号时，那里已经有一本书了。

存在各种冲突处理方法，包括将数据运行到另一个计算中以获得表中的另一个位置(双重哈希)，或者只是在给定的位置附近找到一个空间(例如，就在前一本书的旁边，假设插槽可用，也称为线性探测)。这意味着当你稍后试图找到这本书时，你需要做一些挖掘工作，但这仍然比简单地从图书馆的一端开始要好。

最后，在某些情况下，您可能希望将更多的书放入图书馆，而不是图书馆所允许的。换句话说，你需要建立一个更大的库。由于图书馆中的确切位置是使用图书馆的确切和当前大小计算出来的，因此，如果您调整了图书馆的大小，那么您可能最终不得不为所有书籍找到新的位置，因为为找到它们的位置所做的计算已经改变了。

我希望这个解释比桶和函数更接地气一点:)

2009-04-08 16:33:02

哈希表完全基于这样一个事实，即实际计算遵循随机访问机模型，即内存中任何地址的值都可以在O(1)时间或常数时间内访问。

因此，如果我有一个键的宇宙(我可以在应用程序中使用的所有可能的键的集合，例如，滚动no。对于学生来说，如果它是4位，那么这个宇宙就是从1到9999的一组数字)，并且一种将它们映射到有限大小的数字集的方法可以在我的系统中分配内存，理论上我的哈希表已经准备好了。

Generally, in applications the size of universe of keys is very large than number of elements I want to add to the hash table(I don't wanna waste a 1 GB memory to hash ,say, 10000 or 100000 integer values because they are 32 bit long in binary reprsentaion). So, we use this hashing. It's sort of a mixing kind of "mathematical" operation, which maps my large universe to a small set of values that I can accomodate in memory. In practical cases, often space of a hash table is of the same "order"(big-O) as the (number of elements *size of each element), So, we don't waste much memory.

现在，一个大集合映射到一个小集合，映射必须是多对一的。因此，不同的键将被分配相同的空间(?? ?不公平)。有几种方法可以解决这个问题，我只知道其中最流行的两种:

Use the space that was to be allocated to the value as a reference to a linked list. This linked list will store one or more values, that come to reside in same slot in many to one mapping. The linked list also contains keys to help someone who comes searching. It's like many people in same apartment, when a delivery-man comes, he goes to the room and asks specifically for the guy. Use a double hash function in an array which gives the same sequence of values every time rather than a single value. When I go to store a value, I see whether the required memory location is free or occupied. If it's free, I can store my value there, if it's occupied I take next value from the sequence and so on until I find a free location and I store my value there. When searching or retreiving the value, I go back on same path as given by the sequence and at each location ask for the vaue if it's there until I find it or search all possible locations in the array.

CLRS的《算法导论》对这个主题提供了非常好的见解。

2015-06-12 05:19:45

用法和行话:

哈希表用于快速存储和检索数据(或记录)。记录使用散列键存储在桶中哈希键是通过对记录中包含的选定值(键值)应用哈希算法来计算的。所选值必须是所有记录的公共值。每个桶可以有多条记录，这些记录按照特定的顺序组织。

现实世界的例子:

哈希公司成立于1803年，当时没有任何计算机技术，只有300个文件柜来保存大约3万名客户的详细信息(记录)。每个文件夹都清楚地标识其客户端编号，从0到29,999的唯一编号。

当时的档案管理员必须迅速为工作人员获取和存储客户记录。工作人员决定使用哈希方法来存储和检索他们的记录会更有效。

要归档客户记录，档案管理员将使用写在文件夹上的唯一客户编号。使用这个客户端编号，他们将哈希键调整300，以识别包含它的文件柜。当他们打开文件柜时，他们会发现里面有很多按客户号排序的文件夹。在确定正确的位置后，他们会简单地把它塞进去。

要检索客户记录，档案管理员将在一张纸上获得客户号码。使用这个唯一的客户端编号(哈希键)，他们会将其调整300，以确定哪个文件柜拥有客户端文件夹。当他们打开文件柜时，他们会发现里面有很多按客户号排序的文件夹。通过搜索记录，他们可以快速找到客户端文件夹并检索它。

在我们的实际示例中，桶是文件柜，记录是文件夹。

需要记住的一件重要的事情是，计算机(及其算法)处理数字比处理字符串更好。因此，使用索引访问大型数组要比按顺序访问快得多。

正如Simon提到的，我认为非常重要的是哈希部分是转换一个大空间(任意长度，通常是字符串等)，并将其映射到一个小空间(已知大小，通常是数字)进行索引。记住这一点非常重要!

因此，在上面的示例中，大约30,000个可能的客户机被映射到一个较小的空间中。

这样做的主要思想是将整个数据集划分为几个部分，以加快实际搜索的速度，而实际搜索通常是耗时的。在我们上面的例子中，300个文件柜中的每个(统计上)将包含大约100条记录。搜索100条记录(不管顺序)要比处理3万条记录快得多。

你可能已经注意到有些人已经这样做了。但是，在大多数情况下，他们只是使用姓氏的第一个字母，而不是设计一个哈希方法来生成哈希键。因此，如果您有26个文件柜，每个文件柜都包含从a到Z的一个字母，理论上您只是将数据分割并增强了归档和检索过程。

2009-04-08 17:20:00

你们已经很接近完整地解释了这个问题，但是遗漏了一些东西。哈希表只是一个数组。数组本身将在每个槽中包含一些内容。至少要将哈希值或值本身存储在这个插槽中。除此之外，您还可以存储在此插槽上碰撞的值的链接/链表，或者您可以使用开放寻址方法。您还可以存储一个或多个指针，这些指针指向您希望从该槽中检索的其他数据。

It's important to note that the hashvalue itself generally does not indicate the slot into which to place the value. For example, a hashvalue might be a negative integer value. Obviously a negative number cannot point to an array location. Additionally, hash values will tend to many times be larger numbers than the slots available. Thus another calculation needs to be performed by the hashtable itself to figure out which slot the value should go into. This is done with a modulus math operation like:

uint slotIndex = hashValue % hashTableSize;

这个值是该值将要进入的槽。在开放寻址中，如果槽位已经被另一个哈希值和/或其他数据填充，将再次运行模运算来查找下一个槽:

slotIndex = (remainder + 1) % hashTableSize;

我想可能还有其他更高级的方法来确定槽索引，但这是我见过的最常见的方法……会对其他表现更好的公司感兴趣。

With the modulus method, if you have a table of say size 1000, any hashvalue that is between 1 and 1000 will go into the corresponding slot. Any Negative values, and any values greater than 1000 will be potentially colliding slot values. The chances of that happening depend both on your hashing method, as well as how many total items you add to the hash table. Generally, it's best practice to make the size of the hashtable such that the total number of values added to it is only equal to about 70% of its size. If your hash function does a good job of even distribution, you will generally encounter very few to no bucket/slot collisions and it will perform very quickly for both lookup and write operations. If the total number of values to add is not known in advance, make a good guesstimate using whatever means, and then resize your hashtable once the number of elements added to it reaches 70% of capacity.

我希望这对你有所帮助。

PS - In C# the GetHashCode() method is pretty slow and results in actual value collisions under a lot of conditions I've tested. For some real fun, build your own hashfunction and try to get it to NEVER collide on the specific data you are hashing, run faster than GetHashCode, and have a fairly even distribution. I've done this using long instead of int size hashcode values and it's worked quite well on up to 32 million entires hashvalues in the hashtable with 0 collisions. Unfortunately I can't share the code as it belongs to my employer... but I can reveal it is possible for certain data domains. When you can achieve this, the hashtable is VERY fast. :)

2010-05-15 01:41:55

其实比这更简单。

哈希表不过是一个包含键/值对的向量数组(通常是稀疏数组)。此数组的最大大小通常小于哈希表中存储的数据类型的可能值集中的项数。

哈希算法用于根据将存储在数组中的项的值生成该数组的索引。

This is where storing vectors of key/value pairs in the array come in. Because the set of values that can be indexes in the array is typically smaller than the number of all possible values that the type can have, it is possible that your hash algorithm is going to generate the same value for two separate keys. A good hash algorithm will prevent this as much as possible (which is why it is relegated to the type usually because it has specific information which a general hash algorithm can't possibly know), but it's impossible to prevent.

因此，您可以使用多个键来生成相同的散列代码。当这种情况发生时，将遍历向量中的项，并在向量中的键和正在查找的键之间进行直接比较。如果找到，则返回与该键关联的值，否则不返回任何值。

2009-04-08 16:04:43

哈希表是如何工作的?

推荐文章

最新文章

标签