utilizationpandas
The first step when performing data analysis is to read the file.
In the course of ordinary study and practice, the amount of data used will not be too large, so the step of reading the file is often overlooked by us.
However, in real-world scenarios, it is common to face data volumes of 100,000 or millions, and even if the data is in the tens of millions or hundreds of millions, single-computer processing is not a big problem.
However, when there is more data volume and data attributes, the performance bottleneck of reading files starts to surface.
When we first get the data, we often read the file over and over again, trying various ways to analyze the data.
If you have to wait for a while every time you read a file, it will not only affect your productivity, but also your mood.
I documented my own optimization belowpandas
An exploratory process for reading large files efficiently.
1. Preparatory segment
First, prepare the data.
The data used for the following tests are trading data for some virtual coins and contain values for many analytical factors in addition to the usual K-line data.
import pandas as pd fp = "all_coin_factor_data_12H.csv" df = pd.read_csv(fp, encoding="gbk") # Running results (398070, 224)
The total data volume is close to400,000Each piece of data has224
An attribute.
Then, encapsulate a simpledecoratorto time the function runtime.
from time import time def timeit(func): def func_wrapper(*args, **kwargs): start = time() ret = func(*args, **kwargs) end = time() spend = end - start print("{} cost time: {:.3f} s".format(func.__name__, spend)) return ret return func_wrapper
2. Normal reading
Let's first see how much time it takes to read data of this size.
The following example loops through the10 timesData prepared aboveall_coin_factor_data_12H.csv
。
import pandas as pd @timeit def read(fp): df = pd.read_csv( fp, encoding="gbk", parse_dates=["time"], ) return df if __name__ == "__main__": fp = "./all_coin_factor_data_12H.csv" for i in range(10): read(fp)
The results of the run are as follows:
Read it once for about27 seconds.Around.
3. Compressed reading
Files readall_coin_factor_data_12H.csv
roughly1.5GB
Around.pandas
It is possible to read compressed files directly, try to see if the read performance can be improved after compression.
After compression, about615MB
Around, less than half the point of the pre-compression size.
import pandas as pd @timeit def read_zip(fp): df = pd.read_csv( fp, encoding="gbk", parse_dates=["time"], compression="zip", ) return df if __name__ == "__main__": fp = "./all_coin_factor_data_12H.zip" for i in range(10): read_zip(fp)
The results of the run are as follows:
Read it once for about34 seconds.Around the same time, it's not as fast as just reading it.
4. Batch reading
Next try to read in batches can not improve the speed, read in batches is for the case of particularly large amounts of data, a single machine to deal with the amount of data over 100 million, often use this method to prevent memory overflow.
Try reading it every time first.10,000 entries:
import pandas as pd @timeit def read_chunk(fp, chunksize=1000): df = () reader = pd.read_csv( fp, encoding="gbk", parse_dates=["time"], chunksize=chunksize, ) for chunk in reader: df = ([df, chunk]) df = df.reset_index() return df if __name__ == "__main__": fp = "./all_coin_factor_data_12H.csv" for i in range(10): read_chunk(fp, 10000)
The results of the run are as follows:
About the same performance as reading a compressed file.
If adjusted to read each time100,000 articles, there will be some slight improvement in performance.
When reading in batches, the more you read at once (as long as there is enough memory), the faster it will be.
I actually tried reading it once.1,000The performance is very slow, so I won't take a screenshot here.
5. Reading with polars
The previous attempts have not worked very well, so here's a way to introduce an andpandas
Compatible LibrariesPolars
。
Polars
is a high-performanceDataFrame
library, which is primarily used to manipulate structured data.
It's made withRust
It's written, and the main thing ishigh performance。
utilizationPolars
returned after reading the fileDataframe
Although andpandas
(used form a nominal expression)DataFrame
Not exactly the same, when one can pass a simpleto_pandas
method to complete the conversion.
Here's a look at using thePolars
Read file performance:
import polars as pl @timeit def read_pl(fp): df = pl.read_csv( fp, encoding="gbk", try_parse_dates=True, ) return df.to_pandas() if __name__ == "__main__": fp = "./all_coin_factor_data_12H.csv" for i in range(10): read_pl(fp)
The results of the run are as follows:
utilizationPolars
The performance improvement is very significant, and it seems that mixing thePolars
respond in singingpandas
It's a good program.
6. Reading after serialization
This last method, instead of reading the raw data directly, actually converts the raw data into thepython
ownserialization format(pickle
) and then read it afterward.
This method has an extra conversion step:
fp = "./all_coin_factor_data_12H.csv" df = read(fp) df.to_pickle("./all_coin_factor_data_12H.pkl")
Generate a serialized file:all_coin_factor_data_12H.pkl
。
Then, test the performance of reading this serialized file.
@timeit def read_pkl(fp): df = pd.read_pickle(fp) return df if __name__ == "__main__": fp = "./all_coin_factor_data_12H.pkl" for i in range(10): read_pkl(fp)
The results of the run are as follows:
This one performs surprisingly well andcsv fileserialize intopkl fileAfter that, it takes up half the size of the original disk.
csv
file1.5GB
Around.pkl
The file is only690MB
。
This scheme has some limitations despite its amazing performance, the first being that the raw file can't be the kind of data that changes in real time because the rawcsv fileconvert topkl fileIt also takes time (the test above didn't count this time).
Second, after serialization thepkl filebepythonSpecialized, unlikecsv fileThat kind of generalization is not conducive to otherNon-pythonof system use.
7. Summary
This article explores some of thepandas
Optimization schemes for reading large files, the better one in the end is thePolars Programrespond in singingpickle serializationPrograms.
If our project is to analyze fixed data, such as historical transaction data, historical weather data, historical sales data, and so on, then it's a good idea to considerpickle serializationprogram, first spend time to talk about the raw data serialization, the subsequent analysis is not worried about reading the file to waste time, you can be more efficient to try a variety of analysis ideas.
In cases other than these, it is recommended to usePolars Program。
As a final addition, if the performance of reading files doesn't affect you much, then do it the way it was, don't draw a line in the sand to optimize it, and spend your energy on the business of data analysis.
Above is pandas efficiently read large files of the sample details, more information about pandas read large files please pay attention to my other related articles!