Python Dictionary Underlying Implementation Principles Explained

In Python, dictionaries are implemented by means of hash tables or say hash tables. Dictionaries are also known as associative arrays, also known as hash arrays and so on. That is, a dictionary is also an array, but the index of the array is the hash value obtained after the keys are processed by the hash function. The purpose of the hash function is to make the keys evenly distributed in the array and can be addressed in memory with O(1) time complexity, thus enabling fast lookups and modifications. The difficulty in the design of hash functions in hash tables is to distribute the data evenly in the hash table so as to minimize hash collisions and conflicts. Since different keys may have the same hash value, i.e., conflicts may occur, advanced hash functions can minimize the number of conflicts.Python does not contain such advanced hash functions, a few important (for dealing with strings and integers) hash functions are a few of the common types.

Typically the exact process of building a hash table is as follows:

Data added:The key through the hash function into an integer number, and then the number of the array length of the remainder of the results as an array of subscripts, the value will be stored in the number of subscripts in the array space.

Data Query:Again use the hash function to convert the key to the corresponding array subscript and locate the array position to get the value.

Hash function is a mapping, so the hash function is set very flexible, as long as the hash function value of any keyword derived from this fall within the allowable range of table length. Essentially look at the hash function can not be made into a one-to-one mapping relationship, its nature is a many-to-one mapping, which also leads to the following concept - hash conflict or hash collision. Hash collision is inevitable, but a good hash function design needs to try to avoid hash collision.

Conflicts are resolved in Python2 using the use-open-address method.

CPython uses pseudo-random probing (pseudo-random probing) hash tables as the underlying data structure for dictionaries. Because of this implementation detail, only hashable objects can be used as dictionary keys. The average event complexity of the three basic operations of the dictionary (add element, get element and delete element) is O(1).

All immutable built-in types in Python are hashable.

Variable types (such as lists, dictionaries, and collections) are just not hashable, and therefore cannot be used as keys for dictionaries.

Common hash collision solutions:

1 Open addressing method (open addressing)

In the open addressing method, all the elements are stored in the hash table, when a hash conflict arises, the next candidate location is calculated by a probe function, if the next selected location is still a conflict, then the probe function is constantly looking down until an empty slot is found to store the element to be inserted.

Open address means that in addition to the address derived from the hash function is available, when a conflict occurs other addresses are also available, common methods of open address ideas are linear detection and then hashing, secondary detection and then hashing, etc., these methods are in the case of the first choice of the solution is occupied.

2 Re-hashing

This method is in order to specify a number of hash functions, each time the query in order to call the hash function, call to the first null when the return does not exist, call to this key to return its value.

3 Chain address method

All the records with the same keyword hash value are stored in the same linear chain table, so as not to take up other hash addresses, the same hash value in a chain table, traversed in order to find.

4 Public overflow areas

The basic idea is that all keywords and the basic table keyword for the same hash value of the record, regardless of their hash address obtained by the hash function, once the conflict occurs, are filled into the overflow table.

5 Filling factor α

In general, the average lookup length of a hash table that handles conflicts in the same way depends on the hash table's filling factor. The filling factor of a hash table is defined as the ratio of the number of records filled in the table to the length of the hash table, which signifies the degree to which the hash table is filled. Intuitively, it seems that the smaller α is, the less likely it is that a conflict will occur, and vice versa. Generally 0.75 is more appropriate and involves mathematical derivation.

A key-value in python is an entry.

entry has three states.

Unused： me_key == me_value == NULL

Unused is the initial state of the entry, with key and value both NULL. when an element is inserted, the Unused state is converted to Active. This is the only case where me_key is NULL.

Active： me_key ! = NULL and me_key ! = dummy and me_value ! = NULL

After inserting an element, entry becomes Active, which is the only case where me_value is not NULL, and Active is converted to Dummy when the element is deleted.

Dummy： me_key == dummy and me_value == NULL

The dummy object here is actually a PyStringObject object used only as an indicator. an element in the Dummy state can be made Active when an element is inserted, but it cannot be made Unused again.

Why entries have Dummy state? This is because the use of open addressing method, when the hash conflict will find the next appropriate location, for example, an element should be inserted into A after hash calculation, but at this time there is an element in A, through the probe function to get the next location B, there are still elements, until the location of the C until the time to find C. At this point, the ABC constitutes the probe chain, look for the element if the same value of the hash, then it is also along the chain of detection. Detection chain is constantly looking back, when the deletion of an element in the detection chain, such as B, if the direct removal of B from the hash table, that is, into the Unused state, then it is impossible to find the C, because there is a break between the AC phenomenon, it is so that the third state - Dummy, Dummy is a similar pseudo-deletion, to ensure that the detection of the chain's continuity.

draw attention to sth.

In general the ordinary sequential table array storage structure can also be considered as a simple hash table, although the hash function is not used (take the balance), but the same can be found and modified in O (1) time. But there are two problems with this approach: poor scalability; waste of space.

set set and dict is also based on the same hash table, only his table element only contains a reference to the key, and no reference to the value of the other and the dict is basically the same, so here we will not say more. And dict requires that the key must be able to be hashed immutable objects, so the ordinary set can not be used as the key of the dict, you must choose to be "frozen" immutable set class: frozenset. As the name suggests, once initialized, the set of data can not be modified.

Above this Python dictionary underlying implementation principle in detail is all I have shared with you, I hope to give you a reference, and I hope you support me more.