哈希表是如何工作的?

我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!

例如，我知道它接受键，计算哈希(我正在寻找一个解释)，然后执行某种模运算来计算出它在存储值的数组中的位置，但我的知识到此为止。

谁能解释一下过程吗?

编辑:我并不是特别问哈希码是如何计算的，而是对哈希表如何工作的一般概述。

当前回答

哈希表完全基于这样一个事实，即实际计算遵循随机访问机模型，即内存中任何地址的值都可以在O(1)时间或常数时间内访问。

因此，如果我有一个键的宇宙(我可以在应用程序中使用的所有可能的键的集合，例如，滚动no。对于学生来说，如果它是4位，那么这个宇宙就是从1到9999的一组数字)，并且一种将它们映射到有限大小的数字集的方法可以在我的系统中分配内存，理论上我的哈希表已经准备好了。

Generally, in applications the size of universe of keys is very large than number of elements I want to add to the hash table(I don't wanna waste a 1 GB memory to hash ,say, 10000 or 100000 integer values because they are 32 bit long in binary reprsentaion). So, we use this hashing. It's sort of a mixing kind of "mathematical" operation, which maps my large universe to a small set of values that I can accomodate in memory. In practical cases, often space of a hash table is of the same "order"(big-O) as the (number of elements *size of each element), So, we don't waste much memory.

现在，一个大集合映射到一个小集合，映射必须是多对一的。因此，不同的键将被分配相同的空间(?? ?不公平)。有几种方法可以解决这个问题，我只知道其中最流行的两种:

Use the space that was to be allocated to the value as a reference to a linked list. This linked list will store one or more values, that come to reside in same slot in many to one mapping. The linked list also contains keys to help someone who comes searching. It's like many people in same apartment, when a delivery-man comes, he goes to the room and asks specifically for the guy. Use a double hash function in an array which gives the same sequence of values every time rather than a single value. When I go to store a value, I see whether the required memory location is free or occupied. If it's free, I can store my value there, if it's occupied I take next value from the sequence and so on until I find a free location and I store my value there. When searching or retreiving the value, I go back on same path as given by the sequence and at each location ask for the vaue if it's there until I find it or search all possible locations in the array.

CLRS的《算法导论》对这个主题提供了非常好的见解。

2015-06-12 05:19:45

其他回答

到目前为止，所有的答案都很好，并且从不同的方面了解了哈希表的工作方式。这里有一个简单的例子，可能会有帮助。假设我们想要存储一些带有小写字母字符串的项作为键。

正如simon所解释的，哈希函数用于从大空间映射到小空间。对于我们的例子，一个简单的哈希函数实现可以取字符串的第一个字母，并将其映射为一个整数，因此“短吻鳄”的哈希代码为0，“蜜蜂”的哈希代码为1，“斑马”的哈希代码为25，等等。

接下来，我们有一个包含26个存储桶的数组(在Java中可以是数组列表)，我们将项放入与键的哈希码匹配的存储桶中。如果我们有不止一个元素键以相同字母开头，它们就会有相同的哈希码，所以它们都会进入存储桶中寻找那个哈希码所以必须在存储桶中进行线性搜索才能找到一个特定的元素。

在我们的例子中，如果我们只有几十个项目，键横跨字母表，它会工作得很好。然而，如果我们有一百万个条目，或者所有的键都以'a'或'b'开头，那么我们的哈希表就不是理想的。为了获得更好的性能，我们需要一个不同的哈希函数和/或更多的桶。

2009-04-08 16:41:10

直连地址表

要理解哈希表，直接地址表是我们应该理解的第一个概念。

直接地址表直接使用键作为数组中槽的索引。宇宙键的大小等于数组的大小。在O(1)时间内访问这个键非常快，因为数组支持随机访问操作。

然而，在实现直接地址表之前，有四个注意事项:

要成为有效的数组索引，键应该是整数键的范围是相当小的，否则，我们将需要一个巨大的数组。不能将两个不同的键映射到数组中的同一个槽宇宙键的长度等于数组的长度

事实上，现实生活中并不是很多情况都符合上述要求，所以哈希表就可以救场了

哈希表

哈希表不是直接使用键，而是首先应用数学哈希函数将任意键数据一致地转换为数字，然后使用该哈希结果作为键。

宇宙键的长度可以大于数组的长度，这意味着两个不同的键可以散列到相同的索引(称为散列碰撞)?

实际上，有一些不同的策略来处理它。这里有一个常见的解决方案:我们不将实际值存储在数组中，而是存储一个指向链表的指针，该链表包含散列到该索引的所有键的值。

如果你仍然有兴趣知道如何从头开始实现hashmap，请阅读下面的帖子

2021-04-10 07:41:30

有很多答案，但没有一个是非常可视化的，而哈希表在可视化时很容易“点击”。

哈希表通常实现为链表数组。如果我们想象一个存储人名的表，经过几次插入之后，它可能会被放置在内存中，其中()包含的数字是文本/姓名的哈希值。

bucket#  bucket content / linked list

[0]      --> "sue"(780) --> null
[1]      null
[2]      --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
[3]      --> "mary"(73) --> null
[4]      null
[5]      --> "masayuki"(75) --> "sarwar"(105) --> null
[6]      --> "margaret"(2626) --> null
[7]      null
[8]      --> "bob"(308) --> null
[9]      null

以下几点:

each of the array entries (indices [0], [1]...) is known as a bucket, and starts a - possibly empty - linked list of values (aka elements, in this example - people's names) each value (e.g. "fred" with hash 42) is linked from bucket [hash % number_of_buckets] e.g. 42 % 10 == [2]; % is the modulo operator - the remainder when divided by the number of buckets multiple data values may collide at and be linked from the same bucket, most often because their hash values collide after the modulo operation (e.g. 42 % 10 == [2], and 9282 % 10 == [2]), but occasionally because the hash values are the same (e.g. "fred" and "jane" both shown with hash 42 above) most hash tables handle collisions - with slightly reduced performance but no functional confusion - by comparing the full value (here text) of a value being sought or inserted to each value already in the linked list at the hashed-to bucket

链表长度与负载因子有关，而不是值的数量

如果表的大小增加，上面实现的哈希表倾向于调整自己的大小(即创建一个更大的桶数组，在那里创建新的/更新的链表，删除旧的数组)，以保持值与桶的比率(又名负载因子)在0.5到1.0的范围内。

Hans gives the actual formula for other load factors in a comment below, but for indicative values: with load factor 1 and a cryptographic strength hash function, 1/e　(~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.

哈希表如何将键与值关联

Given a hash table implementation as described above, we can imagine creating a value type such as `struct Value { string name; int age; };`, and equality comparison and hash functions that only look at the `name` field (ignoring age), and then something wonderful happens: we can store `Value` records like `{"sue", 63}` in the table, then later search for "sue" without knowing her age, find the stored value and recover or even update her age - happy birthday Sue - which interestingly doesn't change the hash value so doesn't require that we move Sue's record to another bucket.

当我们这样做的时候，我们使用哈希表作为一个关联容器，也就是map，它存储的值可以被认为是由一个键(名称)和一个或多个其他字段组成，仍然被称为值(在我的例子中，只是年龄)。用作映射的哈希表实现称为哈希映射。

这与前面我们存储离散值的例子形成了对比，比如“sue”，你可以把它看作是它自己的键:这种用法被称为散列集。

还有其他方法来实现哈希表

并不是所有的哈希表都使用链表(称为独立链表)，但大多数通用哈希表都使用链表，因为主要的替代封闭哈希(又名开放寻址)-特别是支持擦除操作-与易于冲突的键/哈希函数相比性能不太稳定。

简单讲一下哈希函数

强大的散列…

一个通用的、最小化最坏情况碰撞的哈希函数的工作是有效地随机地在哈希表桶周围散布键，同时总是为相同的键生成相同的哈希值。理想情况下，即使在键的任何位置改变一个位，也会随机地翻转结果哈希值中的大约一半位。

This is normally orchestrated with maths too complicated for me to grok. I'll mention one easy-to-understand way - not the most scalable or cache friendly but inherently elegant (like encryption with a one-time pad!) - as I think it helps drive home the desirable qualities mentioned above. Say you were hashing 64-bit doubles - you could create 8 tables each of 256 random numbers (code below), then use each 8-bit/1-byte slice of the double's memory representation to index into a different table, XORing the random numbers you look up. With this approach, it's easy to see that a bit (in the binary digit sense) changing anywhere in the double results in a different random number being looked up in one of the tables, and a totally uncorrelated final value.

// note caveats above: cache unfriendly (SLOW) but strong hashing...
std::size_t random[8][256] = { ...random data... };
auto p = (const std::byte*)&my_double;
size_t hash = random[0][p[0]] ^
              random[1][p[1]] ^
              ... ^
              random[7][p[7]];

弱但通常快速的哈希…

Many libraries' hashing functions pass integers through unchanged (known as a trivial or identity hash function); it's the other extreme from the strong hashing described above. An identity hash is extremely collision prone in the worst cases, but the hope is that in the fairly common case of integer keys that tend to be incrementing (perhaps with some gaps), they'll map into successive buckets leaving fewer empty than random hashing leaves (our ~36.8% at load factor 1 mentioned earlier), thereby having fewer collisions and fewer longer linked lists of colliding elements than is achieved by random mappings. It's also great to save the time it takes to generate a strong hash, and if keys are looked up in order they'll be found in buckets nearby in memory, improving cache hits. When the keys don't increment nicely, the hope is they'll be random enough they won't need a strong hash function to totally randomise their placement into buckets.

2015-06-01 06:59:40

Hashtable inside contains cans in which it stores the key sets. The Hashtable uses the hashcode to decide to which the key pair should plan. The capacity to get the container area from Key's hashcode is known as hash work. In principle, a hash work is a capacity which when given a key, creates an address in the table. A hash work consistently returns a number for an item. Two equivalent items will consistently have a similar number while two inconsistent objects may not generally have various numbers. When we put objects into a hashtable then it is conceivable that various objects may have equal/ same hashcode. This is known as a collision. To determine collision, hashtable utilizes a variety of lists. The sets mapped to a single array index are stored in a list and then the list reference is stored in the index.

2021-11-30 15:01:54

哈希的计算方式通常不取决于哈希表，而是取决于添加到哈希表中的项。在框架/基类库(如。net和Java)中，每个对象都有一个GetHashCode()(或类似)方法，返回该对象的哈希码。理想的哈希码算法和准确的实现取决于对象中表示的数据。

2009-04-08 15:52:27

哈希表是如何工作的?

推荐文章

最新文章

标签