Implementation of distributed training in PyTorch

As deep learning models become increasingly large, the video memory of a single GPU can no longer meet the needs of efficient training. At this time, the "Distributed Training" technology came into being and became an important means of accelerating training.

This article will analyze the following three lines of typical PyTorch distributed training code in detail, and will introduce the core concepts and practical methods of distributed training:

local_rank = int(('LOCAL_RANK', -1))  # /docs/stable/elastic/
global_rank = int(('RANK', -1))
world_size = int(('WORLD_SIZE', 1))

1. What is distributed training?

Distributed training refers to the division of model training processes on multiple computing devices (usually multiple GPUs, or even multiple machines) for collaborative processing. The goal isAccelerate training speedandExtend model capacity。

Distributed training can be divided into the following modes:

Data Parallelism: Each GPU processes a different subset of data and synchronizes the gradient.
Model Parallelism: Split the model into multiple parts and deployed to different GPUs separately.
Hybrid Parallelism: Combining model parallelism and data parallelism.
Pipeline Parallelism: Slice the model by layer, and different GPUs process different stages.

2. Understand the core concepts of distributed training

1. World Size (global number of processes)

world_size = int(('WORLD_SIZE', 1))

meaning: The total number of all processes participating in the training in distributed training. Usually equals the total number of GPUs.
effect: Used to initialize process groups (.init_process_group()), let each process know the size of the cluster.

For example, if you have two machines, each with 4 GPUs, thenworld_size = 8。

2. Rank (global process number)

global_rank = int(('RANK', -1))

meaning: Identifies the unique number of each training process in all processes (starting from 0).
effect: Commonly used to mark the master node (rank == 0), control log output, model saving, etc.

For example:

rank=0: Responsible for printing logs and saving models
rank=1,2,…: Only do training

3. Local Rank (local process number)

local_rank = int(('LOCAL_RANK', -1))

meaning: The current training process isLocal machineGPU number on. Generally withCUDA_VISIBLE_DEVICESUse in conjunction.
effect: Used to specify which GPU the process should use, such as:
```
.set_device(local_rank)
```

3. How to set environment variables

These environment variables are usuallyDistributed Initiatorset up. For example, usetorchrun：

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
    --master_addr=192.168.1.1 --master_port=12345

torchrunIt will be automatically set for each process:

LOCAL_RANK
RANK
WORLD_SIZE

You can also export manually:

export WORLD_SIZE=8
export RANK=3
export LOCAL_RANK=3

4. Distributed training initialization process (PyTorch example)

In PyTorch, a typical initialization process is as follows:

import os
import torch
import  as dist

def setup_distributed():
    local_rank = int(('LOCAL_RANK', -1))
    global_rank = int(('RANK', -1))
    world_size = int(('WORLD_SIZE', 1))

    .set_device(local_rank)

    dist.init_process_group(
        backend='nccl',  # Use ncc for GPU and use gloo for CPU        init_method='env://',
        world_size=world_size,
        rank=global_rank
    )

init_method='env://': Indicates reading initialization information from environment variables.
ncclIt is NVIDIA's high-performance communication library that supports high-speed communication between GPUs.

5. The code structure of distributed training

The basic framework for implementing distributed training using PyTorch:

def train():
    setup_distributed()

    model = MyModel().cuda()
    model = (model, device_ids=[local_rank])

    dataset = MyDataset()
    sampler = (dataset)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=64)

    for epoch in range(epochs):
        sampler.set_epoch(epoch)
        for batch in dataloader:
            # Normal training process

6. Elastic Training

It is worth noting that the link mentioned in the comments in the sample code:/docs/stable/elastic/

This refers to PyTorchElastic Training, supports dynamic addition or removal of nodes during training, with high fault tolerance.

tool:
Start the command:torchrun --standalone --nnodes=1 --nproc_per_node=4

This feature is very important for large-scale, long-term training tasks.

7. Summary

Variable name	meaning	source	Typical uses
WORLD_SIZE	Global process number	torchrun settings	Initialize process group, global communication
RANK	The global number of the current process	torchrun settings	Control the behavior of the master node
LOCAL_RANK	The GPU number of the current process is local	torchrun settings	Graphics card binding: .set_device

Although these three lines of code are simple, they are the entrance to PyTorch distributed training. By understanding them, you will understand PyTorch's communication mechanism and training framework in distributed scenarios.

If you want to further understand PyTorch distributed training, the official documentation is recommended:

PyTorch distributed training documentation
Torchrun Instructions

This is the end of this article about the implementation of PyTorch distributed training. For more related content on PyTorch distributed training, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!