SoFunction
Updated on 2024-11-10

Python speed up the use of tools Pandarallel tutorials

As we all know, due to the existence of GIL, all the operations in a Python single process are carried out on one CPU core, so in order to increase the speed of operation, we usually use multi-processing. And multiprocessing is nothing more than the following programs:

  • multiprocessing
  • ()
  • joblib
  • ppserver
  • celery

None of these schemes are particularly friendly to the average python player, what counts as a friendly parallel processing scheme?

That is, I basically don't have to change the original logic, and only modify the line that needs to be calculated to accomplish the solution we are aiming for, and pandarallel is such a friendly tool.

As you can see, in the world of pandarallel, you only need to replace the original pandas processing statements to realize multi-CPU parallel computing. Very convenient and very nice.

On the 4-core CPU performance test, it was close to 4x faster than the original statement. The test conditions (OS: Linux Ubuntu 16.04, Hardware: Intel Core i7 @ 3.40 GHz - 4 cores), that's what I mean, it puts the CPU to good use.

The following is how to use this module, in fact, very simple, any code only need to add a few lines of code to achieve a qualitative leap.

1. Preparation

Before you begin, you'll want to make sure that Python and pip have been successfully installed on your computer

pip install pandarallel

2. Using Pandarallel

Pandarallel needs to be initialized before use:

from pandarallel import pandarallel
()

This is so that the parallel computing API can be called, but there is one important parameter in initialize that needs to be explained, and that is nb_workers, which will specify the number of workers for the parallel computing, and if it is not set, all CPU cores will be used.

Pandarallel supports a total of 8 Pandas operations, here is an example of the apply method.

import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# Initialization
()
df_size = int(5e6)
df = (dict(a=(1, 8, df_size),
                       b=(df_size)))
def func(x):
    return (**2) + (**2)

# Normal processing
res = (func, axis=1)

# Parallel processing
res_parallel = df.parallel_apply(func, axis=1)

# See if the results are the same
(res_parallel)

The other methods are used similarly, by prefixing the original function name with parallel_, e.g. :

import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# Initialization
()
df_size = int(3e7)
df = (dict(a=(1, 1000, df_size),
                       b=(df_size)))
def func(df):
    dum = 0
    for item in :
        dum += math.log10(((item**2)))
        
    return dum / len()

# Normal processing
res = ("a").apply(func)
# Parallel processing
res_parallel = ("a").parallel_apply(func)
(res_parallel)

Another example :

import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# Initialization
()
df_size = int(1e6)
df = (dict(a=(1, 300, df_size),
                       b=(df_size)))
def func(x):
    return [0] + [1] ** 2 + [2] ** 3 + [3] ** 4

# Normal processing
res = ('a').(4).apply(func, raw=False)
# Parallel processing
res_parallel = ('a').(4).parallel_apply(func, raw=False)
(res_parallel)

The cases are all similar, so I'll just list the table here and not waste your precious time reading some repetitive examples:.

3. Cautions

1. I have 8 CPUs, but parallel_apply only speeds up the computation by about 4 times. Why?

A: As I said earlier, each process in Python takes up one core, and Pandarallel can only speed up up to the total number of cores you have at most; a 4-core hyperthreaded CPU will show the operating system that there are 8 CPUs, but there are really only 4 cores, so it's at most 4x faster.

2. Parallelization has a cost (instantiating new processes, sending data through shared memory, ...), so it only makes sense to parallelize if the amount of computation being parallelized is large enough. For very small amounts of data, using Pandarallel is not always worth it.

To this point this article on Python to improve the speed of the tool Pandarallel tutorials on the use of this article, more related to Python Pandarallel content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!