SoFunction
Updated on 2024-11-15

Python Functions Execute 30x Faster with a Few Lines of Code

Python is a popular programming language and the most popular language in the data science community. Compared to other popular programming languages, thePython The main drawback is that its dynamic nature and multifunctional properties slow down the speed performance.Python Code is interpreted at runtime, not compiled to native code at compile time.

1. A basic guide to Python's multithreaded processing

The execution speed of C is faster than that ofPython code fast10 until (a time)100 times. But if you compare the speed of developmentPython faster than C. For data science research, development speed is far more important than runtime performance. Due to the presence of a large number ofAPIframeworks and packages.Python More popular with data scientists and data analysts, it just lags too far behind in performance optimization.

2. Introduction to multi-processing

Consider a single coreCPUIf it is assigned more than one task at the same time, it must constantly interrupt the currently executing task and switch to the next one to keep all processes running. For a multicore processor, theCPU Multiple tasks can be executed simultaneously in different cores, a concept known as parallel processing.

3. Why is it so important?

Data organization, feature engineering, and data exploration are all important elements in the data science model development pipeline. Before being fed into a machine learning model, the raw data needs to be engineered. For smaller datasets, the execution process takes only a few seconds to complete; however, for larger datasets, this task is more burdensome.

Parallel processing is the key to improvingPython An effective method for program performance.Python There is a multiprocessing module that allows us to cross theCPU of different kernels to execute programs in parallel.

4. Realization

We will use the data from themultiprocessing modularPool class that executes a function in parallel against multiple input values. This concept is called data parallelism and it isPool The main goal of the class.

I'll be using data from theKaggle downloadedQuora Problem Pair Similarity dataset to demonstrate this module.

The above dataset contains a lot of data inQuora textual issues raised on the platform. I will be working on aPython function on the execution of the multiprocessing module, this function processes the text data by removing deactivated words, removing HTML tags, removing punctuation, stemming, and other processes.

preprocess() is the function that performs the above text processing steps.

It can be found here hosted on myGitHub function onpreprocess() The code snippet of the
Now, we use themultiprocessing in the modulePool class executes the function in parallel for different chunks of the dataset. Each block of the dataset will be processed in parallel.

import multiprocessing
from functools import partial
from QuoraTextPreprocessing import preprocess

BUCKET_SIZE = 50000

def run_process(df, start):
    df = df[start:start+BUCKET_SIZE]
    print(start, "to ",start+BUCKET_SIZE)
    temp = df["question"].apply(preprocess)

chunks  = [x for x in range(0,[0], BUCKET_SIZE)]   
pool = ()
func = partial(run_process, df)
temp = (func,chunks)
()
()

The dataset has 537,361 records (text issues) need to be processed. For50,000 bucket size, the dataset is divided into 11 smaller data chunks that can be processed in parallel to speed up the execution time of the program.

5. Benchmarking

The question that is often asked is how much faster the execution can be with the multiprocessing module. I have implemented data parallelism by executing the entire dataset oncepreprocess() function after comparing the benchmark execution times.

The machine on which the tests were run had 64GB of RAM and 10 CPU cores.

Baseline times for multiprocessing and uniprocessing execution:

From the above figure, we can observe thatPython The parallel processing of the functions increases the execution speed by nearly 30 Times.

We can do that in myGitHub to record benchmarking data is found in thePythonDocumentation.

Benchmarking process:

Conclusion:

In this paper, we discuss thePython An implementation of the multiprocessing module in the acceleratedPython function execution. After adding a few lines of multiprocessing code, the execution time for a dataset with 537k instances is almost 30 times faster.

When dealing with large datasets, I recommend using parallel processing as it saves a lot of time and speeds up the workflow.

This article on a few lines of code to make Python function execution 30 times faster article is introduced to this, more relevantPython Function execution content please search my previous articles or continue to browse the related articles below I hope you will support me more in the future!