SoFunction
Updated on 2024-11-19

Explaining python bytecode in detail

Python's repeated splicing operations on immutable sequences can be inefficient because a new object is generated each time, and the interpreter needs to copy elements from the original object into the new one before appending new elements.

However, CPython is optimized for string manipulation because it is too common to do += operations on strings. Therefore, str is initialized with extra expandable space, so that incremental operations can be performed without the copy-and-append step.

Examine the process through bytecode.

>>> s_code = 'a += "b"'
>>> c = compile(s_code, '', 'exec')
>>> c.co_code
b'e\x00\x00d\x00\x007Z\x00\x00d\x01\x00S'
>>> c.co_names
('a',)
>>> c.co_consts
('b', None)

The byte code you get is of type Bytes. Here is some knowledge of the Bytes type interspersed.

Bytes type

b'e\x00\x00d\x00\x007Z\x00\x00d\x01\x00S', b indicates that it is the type of Bytes. Bytes record data in the form of a sequence of binary bytes, with each character representing one byte (8 bits). For example, e above represents binary 0110 0101. A partial ASCII code comparison table is shown below.

However, not all bytes are displayable, and even some bytes cannot be corresponded to ASCII (because ASCII defines only 128 characters, while a byte has 256). For example, 0000 0000 corresponds to ASCII is not displayable, 0111 1111 has no corresponding ASCII code.

In order to represent these undisplayable bytes, the \x symbol was introduced, which indicates that the subsequent character is in hexadecimal. For example, \x00 indicates 00 in hexadecimal, or 0000 0000 in binary.

At this point, all bytes can be represented.

bytecode analysis

Go back to the beginning of the code. For display purposes, convert b'e\x00\x00d\x00\x007Z\x00\x00d\x01\x00S' to hexadecimal to display it.

>>> c.co_code.hex()
'650000640000375a000064010053'

The function can be used to get the operation instruction corresponding to the opcode

>>> import opcode
>>> [0x65]
'LOAD_NAME'

Thus, the complete bytecode can be interpreted as (TOS i.e. top-of-stack, top-of-stack element):

nibbles:placement,functionality
65:0,LOAD_NAME
0000:parameters,commander-in-chief (military)co_names[0]value of,assume (office)avalue of,push on a stack
64:3,LOAD_CONST
0000:parameters,commander-in-chief (military)co_consts[0],assume (office)'b',push on a stack
37:6,INPLACE_ADD,TOS = TOS1 + TOS
5a:7,STORE_NAME
0000:parameters,co_names[0]=TOS,assume (office)commander-in-chief (military)栈顶赋值给a
64:10,LOAD_CONST
0100:parameters
53:13,RETURN_VALUE,Returns with TOS to the caller of the function

It is actually possible to obtain readable bytecode directly with the help of the dis function:

>>> import dis
>>> (s_code)
 1      0 LOAD_NAME        0 (a)
       3 LOAD_CONST        0 ('b')
       6 INPLACE_ADD
       7 STORE_NAME        0 (a)
       10 LOAD_CONST        1 (None)
       13 RETURN_VALUE

Full Code:

s_code = 'a += "b"'
c = compile(s_code, '', 'exec')
c.co_code
c.co_names
c.co_consts
c.co_code.hex()
import dis
(s_code)

Very fail, comparing the assignment bytecode of string and tuple doesn't show the optimization of string...

These are the relevant knowledge points about python bytecode in this time, thank you for your support.