Hashtable inside contains cans in which it stores the key sets. The Hashtable uses the hashcode to decide to which the key pair should plan. The capacity to get the container area from Key's hashcode is known as hash work. In principle, a hash work is a capacity which when given a key, creates an address in the table. A hash work consistently returns a number for an item. Two equivalent items will consistently have a similar number while two inconsistent objects may not generally have various numbers. When we put objects into a hashtable then it is conceivable that various objects may have equal/ same hashcode. This is known as a collision. To determine collision, hashtable utilizes a variety of lists. The sets mapped to a single array index are stored in a list and then the list reference is stored in the index.
Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.
你有一些可以存储物品的位置(抽屉)。 你想出一些规则,告诉你每件物品属于哪个位置(抽屉)。 当你需要找东西时,你就用这个规则来决定要找哪个抽屉。
|\| hash |/| --> hash code
data --> |/| function|\|
All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)
确定性的(相同的输入给出相同的输出),但是 看起来是随机的(很难预测一个哈希码给出另一个)。
Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.
What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.
A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.
(这里有时会使用位运算符。如果你的表的大小是2的幂,比如说2k,那么计算哈希码的位与,然后数字2k - 1相当于计算一个模数,而且它明显更快。)
A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)
In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).
Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)
Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.
Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")
Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.
bucket# bucket content / linked list
[0] --> "sue"(780) --> null
[1] null
[2] --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
[3] --> "mary"(73) --> null
[4] null
[5] --> "masayuki"(75) --> "sarwar"(105) --> null
[6] --> "margaret"(2626) --> null
[7] null
[8] --> "bob"(308) --> null
[9] null
each of the array entries (indices [0], [1]...) is known as a bucket, and starts a - possibly empty - linked list of values (aka elements, in this example - people's names) each value (e.g. "fred" with hash 42) is linked from bucket [hash % number_of_buckets] e.g. 42 % 10 == [2]; % is the modulo operator - the remainder when divided by the number of buckets multiple data values may collide at and be linked from the same bucket, most often because their hash values collide after the modulo operation (e.g. 42 % 10 == [2], and 9282 % 10 == [2]), but occasionally because the hash values are the same (e.g. "fred" and "jane" both shown with hash 42 above) most hash tables handle collisions - with slightly reduced performance but no functional confusion - by comparing the full value (here text) of a value being sought or inserted to each value already in the linked list at the hashed-to bucket
Hans gives the actual formula for other load factors in a comment below, but for indicative values: with load factor 1 and a cryptographic strength hash function, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
Given a hash table implementation as described above, we can imagine creating a value type such as `struct Value { string name; int age; };`, and equality comparison and hash functions that only look at the `name` field (ignoring age), and then something wonderful happens: we can store `Value` records like `{"sue", 63}` in the table, then later search for "sue" without knowing her age, find the stored value and recover or even update her age - happy birthday Sue - which interestingly doesn't change the hash value so doesn't require that we move Sue's record to another bucket.当我们这样做的时候,我们使用哈希表作为一个关联容器,也就是map,它存储的值可以被认为是由一个键(名称)和一个或多个其他字段组成,仍然被称为值(在我的例子中,只是年龄)。用作映射的哈希表实现称为哈希映射。
This is normally orchestrated with maths too complicated for me to grok. I'll mention one easy-to-understand way - not the most scalable or cache friendly but inherently elegant (like encryption with a one-time pad!) - as I think it helps drive home the desirable qualities mentioned above. Say you were hashing 64-bit doubles - you could create 8 tables each of 256 random numbers (code below), then use each 8-bit/1-byte slice of the double's memory representation to index into a different table, XORing the random numbers you look up. With this approach, it's easy to see that a bit (in the binary digit sense) changing anywhere in the double results in a different random number being looked up in one of the tables, and a totally uncorrelated final value.
// note caveats above: cache unfriendly (SLOW) but strong hashing...
std::size_t random[8][256] = { ...random data... };
auto p = (const std::byte*)&my_double;
size_t hash = random[0][p[0]] ^
random[1][p[1]] ^
... ^
Many libraries' hashing functions pass integers through unchanged (known as a trivial or identity hash function); it's the other extreme from the strong hashing described above. An identity hash is extremely collision prone in the worst cases, but the hope is that in the fairly common case of integer keys that tend to be incrementing (perhaps with some gaps), they'll map into successive buckets leaving fewer empty than random hashing leaves (our ~36.8% at load factor 1 mentioned earlier), thereby having fewer collisions and fewer longer linked lists of colliding elements than is achieved by random mappings. It's also great to save the time it takes to generate a strong hash, and if keys are looked up in order they'll be found in buckets nearby in memory, improving cache hits. When the keys don't increment nicely, the hope is they'll be random enough they won't need a strong hash function to totally randomise their placement into buckets.
Now to make this work as a hashtable: Imagine you didn't care about memory usage. Then you can create an array as long as your indexing set (all uint32's, for example). As you add something to the table, you hash it's key and look at the array at that index. If there is nothing there, you put your value there. If there is already something there, you add this new entry to a list of things at that address, along with enough information (your original key, or something clever) to find which entry actually belongs to which key.