In-depth understanding of the Python virtual machine in the dictionary (dict) implementation principles and source code dissection

Dictionary Data Structure Analysis

/* The ma_values pointer is NULL for a combined table
 * or points to an array of PyObject* for a split table
 */
typedef struct {
    PyObject_HEAD
    Py_ssize_t ma_used;
    PyDictKeysObject *ma_keys;
    PyObject **ma_values;
} PyDictObject;
 
struct _dictkeysobject {
    Py_ssize_t dk_refcnt;
    Py_ssize_t dk_size;
    dict_lookup_func dk_lookup;
    Py_ssize_t dk_usable;
    PyDictKeyEntry dk_entries[1];
};
 
typedef struct {
    /* Cached hash code of me_key. */
    Py_hash_t me_hash;
    PyObject *me_key;
    PyObject *me_value; /* This field is only meaningful for combined tables */
} PyDictKeyEntry;

The meaning of each field above is:

ob_refcnt, the reference count of the object.
ob_type, the data type of the object.
ma_used, the number of data in the current hash table.
ma_keys, pointing to an array holding key-value pairs.
ma_values, which points to an array of values, is not necessarily used in cpython implementations because the PyDictKeyEntry array in _dictkeysobject can also store value, which may only be used if the keys are all strings, which is what we'll be doing in this article. For the purposes of this article, we're going to use the value in PyDictKeyEntry to discuss the implementation of dictionaries, so you can ignore this variable.
dk_refcnt, this is also used to indicate the reference count, this has to do with the dictionary view, the principle is similar to the reference count, here for the time being regardless.
dk_size, which represents the size of the hash table, must be 2n, in which case the modulo operation can be turned into a bitwise sum operation.
dk_lookup, this represents the hash table lookup function, he is a function pointer.
dk_usable, indicates how many key-value pairs are still available in the current array.
dk_entries, the hash table, where the key-value pairs are actually stored.

The entire hash table is laid out roughly as shown below:

Creating a New Dictionary Object

This function is still relatively simple, first apply for memory space, and then some initialization operations, apply for a hash table used to save key-value pairs.

static PyObject *
dict_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
    PyObject *self;
    PyDictObject *d;
 
    assert(type != NULL && type->tp_alloc != NULL);
    // Claim memory space
    self = type->tp_alloc(type, 0);
    if (self == NULL)
        return NULL;
    d = (PyDictObject *)self;
 
    /* The object has been implicitly tracked by tp_alloc */
    if (type == &PyDict_Type)
        _PyObject_GC_UNTRACK(d);
    // Since no data has been added yet, ma_used = 0 in the hash table.
    d->ma_used = 0;
    // Request an array to hold key-value pairs PyDict_MINSIZE_COMBINED is a macro definition A value of 8 indicates the minimum length of the hash table array.
    d->ma_keys = new_keys_object(PyDict_MINSIZE_COMBINED);
    // Returns NULL if the request fails
    if (d->ma_keys == NULL) {
        Py_DECREF(self);
        return NULL;
    }
    return self;
}
 
// The new_keys_object function looks like this
static PyDictKeysObject *new_keys_object(Py_ssize_t size)
{
    PyDictKeysObject *dk;
    Py_ssize_t i;
    PyDictKeyEntry *ep0;
 
    assert(size >= PyDict_MINSIZE_SPLIT);
    assert(IS_POWER_OF_2(size));
    // This is where the memory is requested. The actual size of the requested memory space is the size of PyDictKeysObject plus the size of size-1 PyDictKeyEntry.
    // Here you might wonder why it's not the size of a PyDictKeyEntry, because there's already a PyDictKeyEntry object requested in the PyDictKeysObject structure.
    dk = PyMem_MALLOC(sizeof(PyDictKeysObject) +
                      sizeof(PyDictKeyEntry) * (size-1));
    if (dk == NULL) {
        PyErr_NoMemory();
        return NULL;
    }
    // Here are some initialization operations dk_refcnt is set to 1 because there is only one dictionary object that uses this PyDictKeysObject object.
    DK_DEBUG_INCREF dk->dk_refcnt = 1;
    dk->dk_size = size; // Size of the hash table
    // The following line of code indicates how many key-value pairs can currently be stored in the hash table. cpython's implementation allows 2/3 of the array space to store data, beyond which it needs to be expanded.
    dk->dk_usable = USABLE_FRACTION(size); // #define USABLE_FRACTION(n) ((((n) << 1)+1)/3)
    ep0 = &dk->dk_entries[0];
    /* Hash value of slot 0 is used by popitem, so it must be initialized */
    ep0->me_hash = 0;
    // initialize all key-value pairs to NULL
    for (i = 0; i < size; i++) {
        ep0[i].me_key = NULL;
        ep0[i].me_value = NULL;
    }
    dk->dk_lookup = lookdict_unicode_nodummy;
    return dk;
}

hash table expansion mechanism

First of all, we first understand the implementation of the dictionary in the hash table expansion mechanism, when we continue to add new data to the dictionary, the dictionary will soon reach the data in the array length of 23, this time the need to expand the expansion of the array after the expansion of the size of the array is calculated as follows:

#define GROWTH_RATE(d) (((d)->ma_used*2)+((d)->ma_keys->dk_size>>1))

The size of the new array is equal to the number of key-value pairs multiplied by 2 plus half the length of the original array.

Overall there are three main steps to expanding capacity:

Calculates the size of the new array.
Creates a new array.
Add the data from the original hash table to the new array (that is, the process of re-hashing).

The specific implementation code is shown below:

static int
insertion_resize(PyDictObject *mp)
{
    return dictresize(mp, GROWTH_RATE(mp));
}
 
static int
dictresize(PyDictObject *mp, Py_ssize_t minused)
{
    Py_ssize_t newsize;
    PyDictKeysObject *oldkeys;
    PyObject **oldvalues;
    Py_ssize_t i, oldsize;
    // The main purpose of the following code is to compute the power of an integer that can be greater than or equal to the smallest minused power of two.
/* Find the smallest table size > minused. */
    for (newsize = PyDict_MINSIZE_COMBINED;
         newsize <= minused && newsize > 0;
         newsize <<= 1)
        ;
    if (newsize <= 0) {
        PyErr_NoMemory();
        return -1;
    }
    oldkeys = mp->ma_keys;
    oldvalues = mp->ma_values;
    /* Allocate a new table. */
   // Create a new array
    mp->ma_keys = new_keys_object(newsize);
    if (mp->ma_keys == NULL) {
        mp->ma_keys = oldkeys;
        return -1;
    }
    if (oldkeys->dk_lookup == lookdict)
        mp->ma_keys->dk_lookup = lookdict;
    oldsize = DK_SIZE(oldkeys);
    mp->ma_values = NULL;
    /* If empty then nothing to copy so just return */
    if (oldsize == 1) {
        assert(oldkeys == Py_EMPTY_KEYS);
        DK_DECREF(oldkeys);
        return 0;
    }
    /* Main loop below assumes we can transfer refcount to new keys
     * and that value is stored in me_value.
     * Increment ref-counts and copy values here to compensate
     * This (resizing a split table) should be relatively rare */
    if (oldvalues != NULL) {
        for (i = 0; i < oldsize; i++) {
            if (oldvalues[i] != NULL) {
                Py_INCREF(oldkeys->dk_entries[i].me_key);
                oldkeys->dk_entries[i].me_value = oldvalues[i];
            }
        }
    }
    /* Main loop */
    // Add the elements of the original array to the new one.
    for (i = 0; i < oldsize; i++) {
        PyDictKeyEntry *ep = &oldkeys->dk_entries[i];
        if (ep->me_value != NULL) {
            assert(ep->me_key != dummy);
            insertdict_clean(mp, ep->me_key, ep->me_hash, ep->me_value);
        }
    }
    // update how much data can be inserted into the current hash table
    mp->ma_keys->dk_usable -= mp->ma_used;
    if (oldvalues != NULL) {
        /* NULL out me_value slot in oldkeys, in case it was shared */
        for (i = 0; i < oldsize; i++)
            oldkeys->dk_entries[i].me_value = NULL;
        assert(oldvalues != empty_values);
        free_values(oldvalues);
        DK_DECREF(oldkeys);
    }
    else {
        assert(oldkeys->dk_lookup != lookdict_split);
        if (oldkeys->dk_lookup != lookdict_unicode_nodummy) {
            PyDictKeyEntry *ep0 = &oldkeys->dk_entries[0];
            for (i = 0; i < oldsize; i++) {
                if (ep0[i].me_key == dummy)
                    Py_DECREF(dummy);
            }
        }
        assert(oldkeys->dk_refcnt == 1);
        DK_DEBUG_DECREF PyMem_FREE(oldkeys);
    }
    return 0;
}

Dictionary insertion data

We are constantly inserting data into the dictionary is likely to encounter hash conflicts, the dictionary to deal with hash conflicts and the set of basic hash conflicts encountered in the processing method is almost the same, are the use of the development of the address method, but the realization of this open address method is relatively more complex, the specific program is shown below:

static void
insertdict_clean(PyDictObject *mp, PyObject *key, Py_hash_t hash,
                 PyObject *value)
{
    size_t i;
    size_t perturb;
    PyDictKeysObject *k = mp->ma_keys;
    // First get the value of the mask
    size_t mask = (size_t)DK_SIZE(k)-1;
    PyDictKeyEntry *ep0 = &k->dk_entries[0];
    PyDictKeyEntry *ep;
  
    i = hash & mask;
    ep = &ep0[i];
    for (perturb = hash; ep->me_key != NULL; perturb >>= PERTURB_SHIFT) {
        // Here's what to do in case of hash conflicts
        i = (i << 2) + i + perturb + 1;
        ep = &ep0[i & mask];
    }
    assert(ep->me_value == NULL);
    ep->me_key = key;
    ep->me_hash = hash;
    ep->me_value = value;
}

summarize

In this article mainly gives you a brief introduction to the implementation mechanism of the dictionary within cpython, in general, the entire dictionary implementation mechanism is still quite complex, this article only talks about a small part of the entire dictionary implementation, the main design of the dictionary's memory layout and the related data structures, the most important dictionary creation expansion and insertion, which is still very helpful for us to understand the structure of the hash table! very helpful!!!

The above is an in-depth understanding of the Python virtual machine dictionary (dict) implementation principles and source code analysis of the details, more information about the Python virtual machine dictionary please pay attention to my other related articles!