As deep learning models become increasingly large, the video memory of a single GPU can no longer meet the needs of efficient training. At this time, the "Distributed Training" technology came into being and became an important means of accelerating training.
This article will analyze the following three lines of typical PyTorch distributed training code in detail, and will introduce the core concepts and practical methods of distributed training:
local_rank = int(('LOCAL_RANK', -1)) # /docs/stable/elastic/ global_rank = int(('RANK', -1)) world_size = int(('WORLD_SIZE', 1))
1. What is distributed training?
Distributed training refers to the division of model training processes on multiple computing devices (usually multiple GPUs, or even multiple machines) for collaborative processing. The goal isAccelerate training speedandExtend model capacity。
Distributed training can be divided into the following modes:
- Data Parallelism: Each GPU processes a different subset of data and synchronizes the gradient.
- Model Parallelism: Split the model into multiple parts and deployed to different GPUs separately.
- Hybrid Parallelism: Combining model parallelism and data parallelism.
- Pipeline Parallelism: Slice the model by layer, and different GPUs process different stages.
2. Understand the core concepts of distributed training
1. World Size (global number of processes)
world_size = int(('WORLD_SIZE', 1))
- meaning: The total number of all processes participating in the training in distributed training. Usually equals the total number of GPUs.
-
effect: Used to initialize process groups (
.init_process_group()
), let each process know the size of the cluster.
For example, if you have two machines, each with 4 GPUs, thenworld_size
= 8。
2. Rank (global process number)
global_rank = int(('RANK', -1))
- meaning: Identifies the unique number of each training process in all processes (starting from 0).
- effect: Commonly used to mark the master node (rank == 0), control log output, model saving, etc.
For example:
- rank=0: Responsible for printing logs and saving models
- rank=1,2,…: Only do training
3. Local Rank (local process number)
local_rank = int(('LOCAL_RANK', -1))
meaning: The current training process isLocal machineGPU number on. Generally with
CUDA_VISIBLE_DEVICES
Use in conjunction.-
effect: Used to specify which GPU the process should use, such as:
.set_device(local_rank)
3. How to set environment variables
These environment variables are usuallyDistributed Initiatorset up. For example, usetorchrun
:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \ --master_addr=192.168.1.1 --master_port=12345
torchrun
It will be automatically set for each process:
LOCAL_RANK
RANK
WORLD_SIZE
You can also export manually:
export WORLD_SIZE=8 export RANK=3 export LOCAL_RANK=3
4. Distributed training initialization process (PyTorch example)
In PyTorch, a typical initialization process is as follows:
import os import torch import as dist def setup_distributed(): local_rank = int(('LOCAL_RANK', -1)) global_rank = int(('RANK', -1)) world_size = int(('WORLD_SIZE', 1)) .set_device(local_rank) dist.init_process_group( backend='nccl', # Use ncc for GPU and use gloo for CPU init_method='env://', world_size=world_size, rank=global_rank )
-
init_method='env://'
: Indicates reading initialization information from environment variables. -
nccl
It is NVIDIA's high-performance communication library that supports high-speed communication between GPUs.
5. The code structure of distributed training
The basic framework for implementing distributed training using PyTorch:
def train(): setup_distributed() model = MyModel().cuda() model = (model, device_ids=[local_rank]) dataset = MyDataset() sampler = (dataset) dataloader = DataLoader(dataset, sampler=sampler, batch_size=64) for epoch in range(epochs): sampler.set_epoch(epoch) for batch in dataloader: # Normal training process
6. Elastic Training
It is worth noting that the link mentioned in the comments in the sample code:/docs/stable/elastic/
This refers to PyTorchElastic Training, supports dynamic addition or removal of nodes during training, with high fault tolerance.
- tool:
- Start the command:
torchrun --standalone --nnodes=1 --nproc_per_node=4
This feature is very important for large-scale, long-term training tasks.
7. Summary
Variable name | meaning | source | Typical uses |
---|---|---|---|
WORLD_SIZE | Global process number | torchrun settings | Initialize process group, global communication |
RANK | The global number of the current process | torchrun settings | Control the behavior of the master node |
LOCAL_RANK | The GPU number of the current process is local | torchrun settings | Graphics card binding: .set_device |
Although these three lines of code are simple, they are the entrance to PyTorch distributed training. By understanding them, you will understand PyTorch's communication mechanism and training framework in distributed scenarios.
If you want to further understand PyTorch distributed training, the official documentation is recommended:
- PyTorch distributed training documentation
- Torchrun Instructions
This is the end of this article about the implementation of PyTorch distributed training. For more related content on PyTorch distributed training, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!