Python read operations on binary byte stream data (bytes vs. bitstring)

A recent project had a requirement to read the contents of a binary file and manipulate the read byte stream data, mainly find and slice to get the contents. This requires two flags, a start and an end, to get the content in the middle.

Python's bytes has some methods built in, but they're not perfect. After investigating, I learned that bitstring, a third-party package, seems to be better at working with byte-stream data.

bytes

bytes: a type of character sequence. By comparing dir(str) with dir(bytes), we can see that the properties and methods of both are very similar, with only a few differences. So bytes can also be like string, there are various operations on byte sequences, such as find, len, split, slice and so on.

The advantage of bytes is that it is a built-in method of Python and does not require any additional installation of third-party modules.

But the drawbacks are also obvious: you can only query a single query, you can't query more than one desired result at a time.

The file is first opened in rb mode by open, and the contents are read as bytes. There is a find() method to find a specific string, but this method only finds the first string index that matches, and gives not a single bit index, but an 8-bit byte index. There is no built-in findall() method when you need to find multiple matching strings. To find more than one, the process is cumbersome, first find the first matching index 1, start with this index 1, find the second matching index 2, and so on, until the end of the query.

with open(path, 'rb') as f:
    datas = ()
    start_char = (b'Start')
    # start_char2 = (b'Start', start_char)
    end_char = (b'End', start_char)
    # end_char2 = (b'End', start_char2)
    data = datas[start_char:end_char]
    print(data)

Note that the above code, start_char and end_char will appear several times, the number of times will not necessarily be the same, you need to get the contents of the two indexes between the two, but can not be looped, but also can not be checked at once. Need to execute the line of code has been commented several times to obtain the keyword index. Since we don't know how many start flags there will be in the file data, we don't know how many times it will be executed, which should be solved by looping, but there doesn't seem to be a variable for looping. This further complicates the problem.

Secondly, the above process needs to be executed twice as it is getting the content between two flags. Therefore the process is even more tedious.

Finding new ways to do this, therefore, is completely necessary.

bitstring

bitstring is a tripartite package that reads binary files as a stream of bytes.

The first sentence of the document is：This package defines classes that simplify bit-wise creation, manipulation and interpretation of data.

The translation is as follows: The classes defined by this package simplify the bit-by-bit creation, manipulation and interpretation of data.

Simply put, it operates directly on data of type bytes.

There are four main categories, as follows:

Bits -- An immutable container for binary data.
BitArray -- A mutable container for binary data.
ConstBitStream -- An immutable container with streaming methods.
BitStream -- A mutable container with streaming methods.

Bits -- Immutable containers of binary data.
BitArray -- A mutable container for binary data.
ConstBitStream -- Immutable container with stream methods.
BitStream -- Variable container with stream methods.

Like bytes, the contents of the file are first read, the keyword index is looked up, and the data contents are sliced.

# update at 2022/05/06 start
# from bistring import ConstBitStream, BitStream
from bitstring import ConstBitStream, BitStream
# update at 2022/05/06 end

hex_datas = ConstBitStream(filename=path)  # Read the contents of the file
start_char = b'Start'
start_chars = hex_datas.findall(start_char, bytealigned=True)  # Find all matches at once, return a generator
start_indexs = []
for start_char in start_chars:
    start_indexs.append(start_char)

end_char = b'End'
end_indexs = []
for start_index in start_indexs:
    end_chars = hex_datas.find(end_char, start=start_index, bytealigned=True)  # Find the first match and return the tuple
    for end_char in end_chars:
        end_indexs.append(end_char)

result = []
for i in range(min(len(start_indexs), len(end_indexs))):
    hex_data = hex_datas[start_indexs[i]:end_indexs[i]]
    str_data = (hex_data).decode('utf-8')
    (str_data)

Code analysis, first import the two classes needed: ConstBitStream, BitStream, get the file content, findall() find all the string indexes that match, find() find the first string indexes that match. Take the smaller value of the two lists at the beginning and the end, slice the data, type '', () method to bytes type, Chinese characters will be garbled, so decode() decode the data to get the desired string.

The whole process is still concise and continuous. The code uses findall(), find(), tobytes() methods. In addition, there are many small details need to pay attention to, for example, start_indexes if empty, the subsequent code should not be executed, end_indexes is empty, and the same is true.

As you can see, the bitstring package is still relatively easy to use. According to the demand, the method used is relatively small, in fact, there are many other methods, according to the need to choose.

This is the end of this article about Python binary byte stream data reading operations (bytes and bitstring), more related to Python binary byte stream reading content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!