Python built-in type str source code learning

introductory

This section of "Getting to Know Python's Built-In Types" introduces you to a variety of common built-in types in Python from a source code perspective.

Before introducing the commonly used type str, in the previous blog: Python source code learning notes: in-depth understanding of Python built-in types - bytes has introduced you and str closely related to the source code of bytes knowledge. This blog looks back to analyze the source code related to str.

1 Unicode

The basic unit of computer storage is the byte, which consists of 8 bits. Since English consists of only 26 letters plus a number of symbols, English characters can be stored directly in bytes. However, other languages (e.g. CJK, etc.) have to use multiple bytes for encoding due to the large number of characters.

As computer technology spreads, non-Latin character encoding techniques continue to evolve, but there are still two relatively significant limitations:

No support for multilingualism: the coding scheme for one language cannot be used for another language
There is no uniform standard: for example, Chinese has GBK, GB2312, GB18030 and other encoding standards.

Because of the inconsistency in encoding, developers need to switch back and forth between different encodings, which inevitably leads to many errors. In order to address these inconsistencies, the Unicode standard was developed, which organizes and encodes most of the world's writing systems so that computers can process text in a uniform way. unicode currently contains over 140,000 characters, and is naturally multilingual. (The uni in Unicode is the root word for "unification".)

2 Unicode in Python

2.1 Benefits of Unicode Objects

After Python 3, str objects are internally represented in Unicode, and thus become Unicode objects in the source code. The advantage of using Unicode is that the core logic of the program is unified in Unicode, and only needs to be decoded and encoded in the input and output layers, which can minimize all kinds of encoding problems.

The diagrams are shown below:

2.2 Python's Optimization of Unicode

Problem: Since there are more than 140,000 characters in Unicode, each character needs at least 4 bytes to save (here it should be 4 bytes because 2 bytes are not enough, usually 3 bytes are not used). While English characters are represented by ASCII code which requires only 1 byte, using Unicode will make the overhead of frequently used English characters 4 times.

First let's look at the size differences between the different forms of str objects in Python:

>>> ('ab') - ('a')
1
>>> ('One Two') - ('One')
2
>>> ('😄😄') - ('😄')
4

As you can see, Python internally optimizes Unicode objects: depending on the text content, the underlying storage unit is selected.

The underlying storage of Unicode objects is divided into three categories based on the Unicode code bit range of text characters:

PyUnicode_1BYTE_KIND: All character code bits between U+0000 and U+00FF.
PyUnicode_2BYTE_KIND: all characters with code points between U+0000 and U+FFFF, and at least one character with code point greater than U+00FF.
PyUnicode_1BYTE_KIND: all characters with code points between U+0000 and U+10FFFF, and at least one character with code point greater than U+FFFF.

The corresponding enumerations are as follows:

enum PyUnicode_Kind {
/* String contains only wstr byte characters.  This is only possible
   when the string was created with a legacy API and _PyUnicode_Ready()
   has not been called yet.  */
    PyUnicode_WCHAR_KIND = 0,
/* Return values of the PyUnicode_KIND() macro: */
    PyUnicode_1BYTE_KIND = 1,
    PyUnicode_2BYTE_KIND = 2,
    PyUnicode_4BYTE_KIND = 4
};

Depending on the classification, different storage units are selected:

/* Py_UCS4 and Py_UCS2 are typedefs for the respective
   unicode representations. */
typedef uint32_t Py_UCS4;
typedef uint16_t Py_UCS2;
typedef uint8_t Py_UCS1;

The correspondence is as follows:

text type	character storage unit (CSU)	Character storage unit size (bytes)
PyUnicode_1BYTE_KIND	Py_UCS1	1
PyUnicode_2BYTE_KIND	Py_UCS2	2
PyUnicode_4BYTE_KIND	Py_UCS4	4

Unicode internal storage structure varies by text type, so the type kind must be saved as a Unicode object public field.Python internal definition of some flags, as a Unicode public field: (between the author's level is limited, here in the subsequent content of the field will not be all introduced, you can follow up on their own to understand. (Hugs~)

interned: whether or not it is maintained by the interned mechanism
kind: type, used to distinguish the size of the underlying storage unit of the character
compact: how memory is allocated, whether the object is separated from the text buffer or not
asscii: whether the text is all plain ASCII.

The PyUnicode_New function initializes the Unicode object based on the number of text characters size and the maximum character maxchar. The function is mainly based on maxchar for Unicode object to choose the most compact character storage unit and the underlying structure: (source code is longer, not listed here, you can understand, the following table form)

	maxchar < 128	128 <= maxchar < 256	256 <= maxchar < 65536	65536 <= maxchar < MAX_UNICODE
kind	PyUnicode_1BYTE_KIND	PyUnicode_1BYTE_KIND	PyUnicode_2BYTE_KIND	PyUnicode_4BYTE_KIND
ascii	1	0	0	0
Character storage unit size (bytes)	1	1	2	4
substructure	PyASCIIObject	PyCompactUnicodeObject	PyCompactUnicodeObject	PyCompactUnicodeObject

3 Underlying Structures for Unicode Objects

3.1 PyASCIIObject

C source code:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;

Source Code Analysis:

length: length of the text

hash: text hash

state: Unicode object flag bit

wstr: a wchar_t pointer to a cached C string, terminated with "\0" (this is not quite the same as another article I read, the other description is: ASCII text is located immediately after the PyASCIIObject structure, I personally feel that this is now a more accurate way to describe it, after all, there is no other field after the source structure). After all, there are no other fields after the source structure)

The diagrams are shown below:

(Note that there is a 4-byte hole after the state field here, which is a phenomenon caused by the memory alignment of structure fields, mainly to optimize the efficiency of memory access)

ASCII text is pointed to by wstr, with 'abc' and the empty string object '' as examples:

3.2 PyCompactUnicodeObject

If the text is not all ASCII, the underlying Unicode object is saved by the PyCompactUnicodeObject structure. the C source code is as follows:

/* Non-ASCII strings allocated through PyUnicode_New use the
   PyCompactUnicodeObject structure.  is set, and the data
   immediately follow the structure. */
typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;

PyCompactUnicodeObject adds 3 fields to PyASCIIObject:

utf8_length: length of text UTF8 encoding

utf8: text in UTF8 encoded form, cached to avoid duplicate encoding operations

wstr_length: the "length" of wstr (the so-called length here did not find a very accurate statement, the author is not quite sure how to print it out, you can do their own research)

Notice that the UTF8 encoded form is not preserved in PyASCIIObject, this is because ASCII itself is legal UTF8, which is why the underlying ASCII text is preserved by PyASCIIObject.

The structure is illustrated:

3.3 PyUnicodeObject

PyUnicodeObject is a concrete implementation of the str object in Python. the C source code is as follows:

/* Strings allocated through PyUnicode_FromUnicode(NULL, len) use the
   PyUnicodeObject structure. The actual string data is initially in the wstr
   block, and copied into the data block using _PyUnicode_Ready. */
typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;

3.4 Examples

In daily development, note the difference in memory size before and after string splicing in context:

>>> import sys
>>> text = 'a' * 1000
>>> (text)
1049
>>> text += '😄'
>>> (text)
4080

4 interned mechanism

If the interned flag bit of the str object is 1, the Python VM will enable the interned mechanism for it.

The source code is as follows: (related information on the Internet can see a lot of claims and explanations, here the author's limited ability, temporarily did not find the most definitive answer, and then added. (I'm not sure if I can find the exact answer here, but I'll add it later.)

/* This dictionary holds all interned unicode strings.  Note that references
   to strings in this dictionary are *not* counted in the string's ob_refcnt.
   When the interned string reaches a refcnt of 0 the string deallocation
   function will delete the reference from this dictionary.
   Another way to look at this is that to say that the actual reference
   count of a string is:  s->ob_refcnt + (s->state ? 2 : 0)
*/
static PyObject *interned = NULL;
void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting
       it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}

As you can see, the source code still does some basic checking up front. Let's take a look at lines 37 and 50: when adding s to the interned dictionary, s is actually both key and value (I'm not sure why this is done), so the reference count of s is +2 (see the source code of PyDict_SetDefault() for more details), so it will be counted as -2 in line 50 to ensure that the reference count is correct. to ensure that the reference count is correct.

Consider the following scenario:

>>> class User:
    def __init__(self, name, age):
         = name
         = age
>>> user = User('Tom', 21)
>>> user.__dict__
{'name': 'Tom', 'age': 21}

Since the object's attributes are held by dict, this means that every User object has to hold a str object 'name', which wastes a lot of memory. This would waste a lot of memory. str is an immutable object, so Python internally makes strings with potential duplication into a singleton pattern, which is the interned mechanism. python does this by maintaining a global dict object internally, where all str objects with the interned mechanism turned on are stored, and then when they need to be used, they are created first, and then if the same string is already maintained, the newly created str object will be used. If the same string is already maintained, the newly created object will be recycled.

Example:

Generating 'abc' from different operations all ends up being the same object:

>>> a = 'abc'
>>> b = 'ab' + 'c'
>>> id(a), id(b), a is b
(2752416949872, 2752416949872, True)

5 Summary

Personal Reflection: In writing this blog checked a lot of information, see a lot of existing but different sayings, in organizing the learning time feel a little strenuous, but as far as possible did not directly output inaccurate point of view, but based on the real source code to analyze for everyone. And str of the relevant content should be by far the most built-in types in the most mixed, the follow-up will add the list and dict of the relevant content than it is clear and precise, of course, one of the biggest problem is still the author's ability. Blog should still have errors and deficiencies, but try to explain the source code part of the explanation to be accurate. At present, the author's ability is limited, and after the future progress, then the blog in the errors and deficiencies of the place to correct and add. Hugs~

The above is Python built-in type str source code to learn the details, more information about Python built-in type str please pay attention to my other related articles!