Python is a popular programming language and the most popular language in the data science community. Compared to other popular programming languages, thePython
The main drawback is that its dynamic nature and multifunctional properties slow down the speed performance.Python
Code is interpreted at runtime, not compiled to native code at compile time.
1. A basic guide to Python's multithreaded processing
The execution speed of C is faster than that ofPython
code fast10
until (a time)100
times. But if you compare the speed of developmentPython
faster than C. For data science research, development speed is far more important than runtime performance. Due to the presence of a large number ofAPI
frameworks and packages.Python
More popular with data scientists and data analysts, it just lags too far behind in performance optimization.
2. Introduction to multi-processing
Consider a single coreCPU
If it is assigned more than one task at the same time, it must constantly interrupt the currently executing task and switch to the next one to keep all processes running. For a multicore processor, theCPU
Multiple tasks can be executed simultaneously in different cores, a concept known as parallel processing.
3. Why is it so important?
Data organization, feature engineering, and data exploration are all important elements in the data science model development pipeline. Before being fed into a machine learning model, the raw data needs to be engineered. For smaller datasets, the execution process takes only a few seconds to complete; however, for larger datasets, this task is more burdensome.
Parallel processing is the key to improvingPython
An effective method for program performance.Python
There is a multiprocessing module that allows us to cross theCPU
of different kernels to execute programs in parallel.
4. Realization
We will use the data from themultiprocessing
modularPool
class that executes a function in parallel against multiple input values. This concept is called data parallelism and it isPool
The main goal of the class.
I'll be using data from the
Kaggle
downloadedQuora
Problem Pair Similarity dataset to demonstrate this module.
The above dataset contains a lot of data inQuora
textual issues raised on the platform. I will be working on aPython
function on the execution of the multiprocessing module, this function processes the text data by removing deactivated words, removing HTML tags, removing punctuation, stemming, and other processes.
preprocess()
is the function that performs the above text processing steps.
It can be found here hosted on myGitHub
function onpreprocess()
The code snippet of the
Now, we use themultiprocessing
in the modulePool
class executes the function in parallel for different chunks of the dataset. Each block of the dataset will be processed in parallel.
import multiprocessing from functools import partial from QuoraTextPreprocessing import preprocess BUCKET_SIZE = 50000 def run_process(df, start): df = df[start:start+BUCKET_SIZE] print(start, "to ",start+BUCKET_SIZE) temp = df["question"].apply(preprocess) chunks = [x for x in range(0,[0], BUCKET_SIZE)] pool = () func = partial(run_process, df) temp = (func,chunks) () ()
The dataset has 537,361
records (text issues) need to be processed. For50,000
bucket size, the dataset is divided into 11 smaller data chunks that can be processed in parallel to speed up the execution time of the program.
5. Benchmarking
The question that is often asked is how much faster the execution can be with the multiprocessing module. I have implemented data parallelism by executing the entire dataset oncepreprocess()
function after comparing the benchmark execution times.
The machine on which the tests were run had 64GB of RAM and 10 CPU cores.
Baseline times for multiprocessing and uniprocessing execution:
From the above figure, we can observe thatPython
The parallel processing of the functions increases the execution speed by nearly 30
Times.
We can do that in my
GitHub
to record benchmarking data is found in thePython
Documentation.
Benchmarking process:
Conclusion:
In this paper, we discuss thePython
An implementation of the multiprocessing module in the acceleratedPython
function execution. After adding a few lines of multiprocessing code, the execution time for a dataset with 537k instances is almost 30 times faster.
When dealing with large datasets, I recommend using parallel processing as it saves a lot of time and speeds up the workflow.
This article on a few lines of code to make Python function execution 30 times faster article is introduced to this, more relevantPython
Function execution content please search my previous articles or continue to browse the related articles below I hope you will support me more in the future!