SoFunction
Updated on 2024-12-13

pandas efficiently read large files of the example details

utilizationpandas The first step when performing data analysis is to read the file.

In the course of ordinary study and practice, the amount of data used will not be too large, so the step of reading the file is often overlooked by us.

However, in real-world scenarios, it is common to face data volumes of 100,000 or millions, and even if the data is in the tens of millions or hundreds of millions, single-computer processing is not a big problem.

However, when there is more data volume and data attributes, the performance bottleneck of reading files starts to surface.

When we first get the data, we often read the file over and over again, trying various ways to analyze the data.

If you have to wait for a while every time you read a file, it will not only affect your productivity, but also your mood.

I documented my own optimization belowpandasAn exploratory process for reading large files efficiently.

1. Preparatory segment

First, prepare the data.

The data used for the following tests are trading data for some virtual coins and contain values for many analytical factors in addition to the usual K-line data.

import pandas as pd

fp = "all_coin_factor_data_12H.csv"
df = pd.read_csv(fp, encoding="gbk")


# Running results
(398070, 224)

The total data volume is close to400,000Each piece of data has224An attribute.

Then, encapsulate a simpledecoratorto time the function runtime.

from time import time

def timeit(func):
    def func_wrapper(*args, **kwargs):
        start = time()
        ret = func(*args, **kwargs)
        end = time()
        spend = end - start
        print("{} cost time: {:.3f} s".format(func.__name__, spend))
        return ret

    return func_wrapper

2. Normal reading

Let's first see how much time it takes to read data of this size.

The following example loops through the10 timesData prepared aboveall_coin_factor_data_12H.csv

import pandas as pd

@timeit
def read(fp):
    df = pd.read_csv(
        fp,
        encoding="gbk",
        parse_dates=["time"],
    )
    return df

if __name__ == "__main__":
    fp = "./all_coin_factor_data_12H.csv"
    for i in range(10):
        read(fp)

The results of the run are as follows:

Read it once for about27 seconds.Around.

3. Compressed reading

Files readall_coin_factor_data_12H.csvroughly1.5GBAround.pandasIt is possible to read compressed files directly, try to see if the read performance can be improved after compression.

After compression, about615MB Around, less than half the point of the pre-compression size.

import pandas as pd

@timeit
def read_zip(fp):
    df = pd.read_csv(
        fp,
        encoding="gbk",
        parse_dates=["time"],
        compression="zip",
    )
    return df

if __name__ == "__main__":
    fp = "./all_coin_factor_data_12H.zip"
    for i in range(10):
        read_zip(fp)

The results of the run are as follows:

Read it once for about34 seconds.Around the same time, it's not as fast as just reading it.

4. Batch reading

Next try to read in batches can not improve the speed, read in batches is for the case of particularly large amounts of data, a single machine to deal with the amount of data over 100 million, often use this method to prevent memory overflow.

Try reading it every time first.10,000 entries

import pandas as pd

@timeit
def read_chunk(fp, chunksize=1000):
    df = ()
    reader = pd.read_csv(
        fp,
        encoding="gbk",
        parse_dates=["time"],
        chunksize=chunksize,
    )
    for chunk in reader:
        df = ([df, chunk])

    df = df.reset_index()
    return df

if __name__ == "__main__":
    fp = "./all_coin_factor_data_12H.csv"
    for i in range(10):
        read_chunk(fp, 10000)

The results of the run are as follows:

About the same performance as reading a compressed file.

If adjusted to read each time100,000 articles, there will be some slight improvement in performance.

When reading in batches, the more you read at once (as long as there is enough memory), the faster it will be.

I actually tried reading it once.1,000The performance is very slow, so I won't take a screenshot here.

5. Reading with polars

The previous attempts have not worked very well, so here's a way to introduce an andpandasCompatible LibrariesPolars

Polarsis a high-performanceDataFramelibrary, which is primarily used to manipulate structured data.

It's made withRustIt's written, and the main thing ishigh performance

utilizationPolarsreturned after reading the fileDataframeAlthough andpandas(used form a nominal expression)DataFrameNot exactly the same, when one can pass a simpleto_pandasmethod to complete the conversion.

Here's a look at using thePolarsRead file performance:

import polars as pl

@timeit
def read_pl(fp):
    df = pl.read_csv(
        fp,
        encoding="gbk",
        try_parse_dates=True,
    )
    return df.to_pandas()

if __name__ == "__main__":
    fp = "./all_coin_factor_data_12H.csv"
    for i in range(10):
        read_pl(fp)

The results of the run are as follows:

utilizationPolarsThe performance improvement is very significant, and it seems that mixing thePolarsrespond in singingpandasIt's a good program.

6. Reading after serialization

This last method, instead of reading the raw data directly, actually converts the raw data into thepythonownserialization formatpickle) and then read it afterward.

This method has an extra conversion step:

fp = "./all_coin_factor_data_12H.csv"
df = read(fp)
df.to_pickle("./all_coin_factor_data_12H.pkl")

Generate a serialized file:all_coin_factor_data_12H.pkl

Then, test the performance of reading this serialized file.

@timeit
def read_pkl(fp):
    df = pd.read_pickle(fp)
    return df

if __name__ == "__main__":
    fp = "./all_coin_factor_data_12H.pkl"
    for i in range(10):
        read_pkl(fp)

The results of the run are as follows:

This one performs surprisingly well andcsv fileserialize intopkl fileAfter that, it takes up half the size of the original disk.

csvfile1.5GBAround.pklThe file is only690MB

This scheme has some limitations despite its amazing performance, the first being that the raw file can't be the kind of data that changes in real time because the rawcsv fileconvert topkl fileIt also takes time (the test above didn't count this time).

Second, after serialization thepkl filebepythonSpecialized, unlikecsv fileThat kind of generalization is not conducive to otherNon-pythonof system use.

7. Summary

This article explores some of thepandasOptimization schemes for reading large files, the better one in the end is thePolars Programrespond in singingpickle serializationPrograms.

If our project is to analyze fixed data, such as historical transaction data, historical weather data, historical sales data, and so on, then it's a good idea to considerpickle serializationprogram, first spend time to talk about the raw data serialization, the subsequent analysis is not worried about reading the file to waste time, you can be more efficient to try a variety of analysis ideas.

In cases other than these, it is recommended to usePolars Program

As a final addition, if the performance of reading files doesn't affect you much, then do it the way it was, don't draw a line in the sand to optimize it, and spend your energy on the business of data analysis.

Above is pandas efficiently read large files of the sample details, more information about pandas read large files please pay attention to my other related articles!