Basic Python implementation of GPU acceleration

Threads and Blocks in CUDA

GPU from the computational logic, can be considered a highly parallel computing array, we can imagine a two-dimensional grid like a chess board, each grid can perform a separate task, and all the grids can perform computing tasks at the same time, which is the source of GPU acceleration. So for the board mentioned earlier, each column is considered a thread with its own thread number, and each row is a block with its own block number. We can understand the logic of this with some simple programs:

Printing thread numbers with the GPU

# numba_cuda_test.py

from numba import cuda

@
def gpu():
    print ('threadIdx:', )

if __name__ == '__main__':
    gpu[2,4]()

threadIdx: 0
threadIdx: 1
threadIdx: 2
threadIdx: 3
threadIdx: 0
threadIdx: 1
threadIdx: 2
threadIdx: 3

Printing block numbers with the GPU

# numba_cuda_test.py

from numba import cuda

@
def gpu():
    print ('blockIdx:', )

if __name__ == '__main__':
    gpu[2,4]()

blockIdx: 0
blockIdx: 0
blockIdx: 0
blockIdx: 0
blockIdx: 1
blockIdx: 1
blockIdx: 1
blockIdx: 1

Printing the dimensions of a block with the GPU

# numba_cuda_test.py

from numba import cuda

@
def gpu():
    print ('blockDim:', )

if __name__ == '__main__':
    gpu[2,4]()

blockDim: 4
blockDim: 4
blockDim: 4
blockDim: 4
blockDim: 4
blockDim: 4
blockDim: 4
blockDim: 4

Printing the dimensions of threads with GPUs

# numba_cuda_test.py

from numba import cuda

@
def gpu():
    print ('gridDim:', )

if __name__ == '__main__':
    gpu[2,4]()

gridDim: 2
gridDim: 2
gridDim: 2
gridDim: 2
gridDim: 2
gridDim: 2
gridDim: 2
gridDim: 2

summarize

We can summarize the concept of the GPU grid just mentioned with the following diagram. In the test case above, we divide a 2*4 sized array on the GPU for our own computation, where each row is a block, each column is a thread, and all grids are executing the content of the computation at the same time (if there are no logical dependencies).

Maximum parallelism supported by the GPU

We can test the parallelism of the GPU with a couple of simple programs, because each grid on the GPU can perform a task independently, so we think that we can allocate as many grids as we want. The maximum parallelism natively should be in $2^40$, so suppose we assign $2^50$ sized grids to the GPU, the program will report an error:

# numba_cuda_test.py

from numba import cuda

@
def gpu():
    pass

if __name__ == '__main__':
    gpu[2**50,1]()
    print ('Running Success!')

The results of the run are as follows:

Traceback (most recent call last):
File "numba_cuda_test.py", line 10, in <module>
gpu[2**50,1]()
File "/home/dechin/.local/lib/python3.7/site-packages/numba/cuda/", line 822, in __call__
, )
File "/home/dechin/.local/lib/python3.7/site-packages/numba/cuda/", line 966, in call
(args, griddim, blockdim, stream, sharedmem)
File "/home/dechin/.local/lib/python3.7/site-packages/numba/cuda/", line 699, in launch
cooperative=)
File "/home/dechin/.local/lib/python3.7/site-packages/numba/cuda/cudadrv/", line 2100, in launch_kernel
None)
File "/home/dechin/.local/lib/python3.7/site-packages/numba/cuda/cudadrv/", line 300, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/home/dechin/.local/lib/python3.7/site-packages/numba/cuda/cudadrv/", line 335, in _check_error
raise CudaAPIError(retcode, msg)
: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

And if we assign a grid within the rated size, the program works fine:

# numba_cuda_test.py

from numba import cuda

@
def gpu():
    pass

if __name__ == '__main__':
    gpu[2**30,1]()
    print ('Running Success!')

A printout is added here:

Running Success!

It is important to note that the allocatable sizes on the two dimensions are not consistent, for example, the local limit is to allocate 230*210 sized space for calculations:

# numba_cuda_test.py

from numba import cuda

@
def gpu():
    pass

if __name__ == '__main__':
    gpu[2**30,2**10]()
    print ('Running Success!')

Again, it is possible to execute successfully as long as it is within the allowed range:

Running Success!

If you have more than one GPU on the machine, you can also use theselect_deviceinstruction to select the GPU number to execute the instruction:

# numba_cuda_test.py

from numba import cuda
cuda.select_device(1)
import time

@
def gpu():
    pass

if __name__ == '__main__':
    gpu[2**30,2**10]()
    print ('Running Success!')

If the two GPUs have the same amount of allocatable space, it will run successfully:

Running Success!

GPU acceleration effects

Earlier we often mentioned a word called GPU acceleration, GPU can realize the effect of acceleration, is derived from the high degree of parallelism of the GPU itself. Here we directly use an array summing case to illustrate the effect of GPU acceleration, this case need to get the result is $b_j=a_j+b_j$, the summed value will be assigned to one of the input array to save some memory space. Of course, you can't do this if the array has other uses. The specific code is as follows:

# gpu_add.py

from numba import cuda
cuda.select_device(1)
import numpy as np
import time

@
def gpu(a,b,DATA_LENGHTH):
    idx =  +  * 
    if idx < DATA_LENGHTH:
        b[idx] += a[idx]

if __name__ == '__main__':
    (1)
    DATA_EXP_LENGTH = 20
    DATA_DIMENSION = 2**DATA_EXP_LENGTH
    np_time = 0.0
    nb_time = 0.0
    for i in range(100):
        a = (DATA_DIMENSION).astype(np.float32)
        b = (DATA_DIMENSION).astype(np.float32)
        a_cuda = cuda.to_device(a)
        b_cuda = cuda.to_device(b)
        time0 = ()
        gpu[DATA_DIMENSION,4](a_cuda,b_cuda,DATA_DIMENSION)
        time1 = ()
        c = b_cuda.copy_to_host()
        time2 = ()
        d = (a,b)
        time3 = ()
        if i == 0:
            print ('The error between numba and numpy is: ', sum(c-d))
            continue
        np_time += time3 - time2
        nb_time += time1 - time0
    print ('The time cost of numba is: {}s'.format(nb_time))
    print ('The time cost of numpy is: {}s'.format(np_time))

It is important to note that the GPU-accelerated Numba-based implementation of Python uses a jit-on-the-fly compilation model, which means that it is compiled and optimized at the time of the runtime call to the function in question. In other words, the first time this instruction is executed, the acceleration is in fact not achieved, because this runtime includes a long compilation time. However, from the second run, there is no need to recompile, and the effect of GPU acceleration is realized with the following results:

$ python3 gpu_add.py The error between numba and numpy is: 0.0
The time cost of numba is: 0.018711328506469727s
The time cost of numpy is: 0.09502553939819336s

As you can see, even compared to the Numpy implementation, which is a very powerful optimization in Python, the GPU-accelerated program we wrote can achieve a 5x speedup (in a previous blog post, the speedup can be up to 1,000x or more for specific computational scenarios), and the degree of customizability is very high.

Summary outline

This article addresses some basic concepts and implementations of GPU-accelerated programs in Python using Numba, such as the concepts of threads and modules in the GPU, as well as giving a code example of vector addition to further illustrate the effects of GPU acceleration. Note that since the Numba implementation in Python is an on-the-fly compilation technique, it will take significantly longer for the first operation, so we generally say that GPU acceleration refers to the runtime from the second step onwards. For some common industrial and academic scenarios, such as system evolution in molecular dynamics simulation, or parameter optimization in deep learning and quantum computation, which is a process of multi-step computation of the same dimensional parameters, the on-the-fly compilation technique is very suitable for the use of GPU acceleration with a high degree of parallelism, which can play a great role in various scenarios in the real industry and academia.

This article on the basic operation of Python to achieve GPU acceleration is introduced to this article, more related Python GPU acceleration content, please search for my previous articles or continue to browse the following related articles I hope that you will support me more in the future!