Distributed and parallel training... explained

pierreguillou · June 29, 2020, 9:16pm

Data Parallel (DP) and Distributed Data Parallel (DDP) training in Pytorch and fastai v2

For training a Deep Learning model in parallel using PyTorch or fastai v2, there are 2 modes: DataParallel (DP) and Distributed Data Parallel (DDP) but you should use DDP instead of DP (see below for explications).

1. Pytorch | How to train a model across multi-GPUs?

Pytorch | Data Parallel (DP)

nn.DataParallel (DP) is for performing one-process on multiple devices of a single machine .

As an example, it can perform the training of your Deep Learning model (which is a process) by distributing it on many GPUs of a single machine (GPU is a device).

How? By distributing batches of the training and validation dataloaders on the GPUs available. This is data parallelism at the module level.
Positive : batch size can be bigger as batches will be equally distributed to all GPUs
Negative: just-one-process is a bottleneck that can increase process time

This is the official definition from PyTorch:

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device). In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.

The batch size should be larger than the number of GPUs used.

WARNING: It is recommended to use DistributedDataParallel , instead of this class, to do multi-GPU training, even if there is only a single node. See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel and Distributed Data Parallel.

PyTorch code

You will find the full code of the notebook 05_pet_breeds.ipynb using Data Parallel PyTorch code in my notebook 05_pet_breeds_DataParallel.ipynb (nbviewer version).

The lines chaves are the following ones:

if torch.cuda.device_count() > 1:
    learn.model = nn.DataParallel(learn.model)

PyTorch forum

About Data Parallel and DataParallel

Pytorch | Distributed Data Parallel (DDP)

nn.parallel.DistributedDataParallel (DDP) is useful when you want to perform multi-processes on devices of multiple machines but you can use it on devices of just a single machine as well: differently than DataParallel, within DDP each device (GPU) performs independently one copy of the process on a part of the training dataset (this is true process and data parallelism).

This is the official definition from PyTorch:

DDP implements distributed data parallelism that is based on torch.distributed package at the module level.

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.

See also: Basics and Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. The same constraints on input as in torch.nn.DataParallel apply.

Creation of this class requires that torch.distributed to be already initialized, by calling torch.distributed.init_process_group() .

DistributedDataParallel is proven to be significantly faster than torch.nn.DataParallel for single-node multi-GPU data parallel training.

PyTorch forum

About Distributed Data Parallel and DistributedDataParallel

PyTorch code

ImageNet training in PyTorch: this implements training of popular model architectures (file main.py), such as ResNet, AlexNet, and VGG on the ImageNet dataset with multi-processing Distributed Data Parallel Training:
- Single node, multiple GPUs
- Multiple nodes

PyTorch tutorials

Distributed data parallel training in Pytorch
- Motivation: The easiest way to speed up neural network training is to use a GPU, which provides large speedups over CPUs on the types of calculations (matrix multiplies and additions) that are common in neural networks. As the model or dataset gets bigger, one GPU quickly becomes insufficient. For example, big language models such as BERT and GPT-2 are trained on hundreds of GPUs. To multi-GPU training, we must have a way to split the model and data between different GPUs and to coordinate the training.
- Why distributed data parallel?: I like to implement my models in Pytorch because I find it has the best balance between control and ease of use of the major neural-net frameworks. Pytorch has two ways to split models and data across multiple GPUs: nn.DataParallel and nn.DistributedDataParallel . nn.DataParallel is easier to use (just wrap the model and run your training script). However, because it uses one process to compute the model weights and then distribute them to each GPU during each batch, networking quickly becomes a bottle-neck and GPU utilization is often very low. Furthermore, nn.DataParallel requires that all the GPUs be on the same node and doesn’t work with Apex for mixed-precision training.
PyTorch Distributed Training
- PyTorch has relatively simple interface for distributed training. To do distributed training, the model would just have to be wrapped using DistributedDataParallel and the training script would just have to be launched using torch.distributed.launch . Although PyTorch has offered a series of tutorials on distributed training, I found it insufficient or overwhelming to help the beginners to do state-of-the-art PyTorch distributed training. Some key details were missing and the usages of Docker container in distributed training were not mentioned at all.
- In this blog post, I would like to present a simple implementation of PyTorch distributed training on CIFAR-10 classification using DistributedDataParallel wrapped ResNet models. The usage of Docker container for distributed training and how to start distributed training using torch.distributed.launch would also be covered.
Getting Started with Distributed Data Parallel
- DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers. More specifically, DDP registers an autograd hook for each parameter given by model.parameters() and the hook will fire when the corresponding gradient is computed in the backward pass. Then DDP uses that signal to trigger gradient synchronization across processes. Please refer to DDP design note for more details.
- The recommended way to use DDP is to spawn one process for each model replica, where a model replica can span multiple devices. DDP processes can be placed on the same machine or across machines, but GPU devices cannot be shared across processes. This tutorial starts from a basic DDP use case and then demonstrates more advanced use cases including checkpointing models and combining DDP with model parallel.
Distributed Data Parallel: torch.nn.parallel.DistributedDataParallel (DDP) transparently performs distributed data parallel training. This page describes how it works and reveals implementation details.

2. fastai v2 | How to train a model across multi-GPUs?

fastai v2 implements the 2 modes as documented in Distributed and parallel training fastai v2 doc.

fastai v2 | Data Parallel (DP)

This is the simplest one. Use the class functions of the class ParallelTrainer.

You will find the full code of the notebook 05_pet_breeds.ipynb using Data Parallel fastai v2 code in my notebook 05_pet_breeds_DataParallel.ipynb (nbviewer version).

The lines chaves are the following ones:

ctx = learn.parallel_ctx
with partial(ctx, gpu)():
    print(f"Training in {ctx.__name__} context on GPU {list(range(n_gpu))}")
    learn.fine_tune(2)

fastai v2 | Distributed Data Parallel (DDP)

This is the most fast one as it will train truly in parallel all the model copies (one model copy by GPU). Use the class functions of the class DistributedTrainer.

As you will need to launch at least 2 process in parallel, this can not be done in a Jupyter notebook but in a Terminal via the following command to be run at the script.py path within a fastai v2 virtual environment:

python -m fastai2.launch script.py

with fastai2.launch that refers to launch.py and script.py that contains the DDP formatted training code like the following files in fastai > fastai2 > nbs > examples in github:

train_imagenette.py (images classification)
train_imdbclassifier.py (texts classification)
train_tabular.py (regression)

Notebooks on DDP from fastai v2

I run the following commands at the path of the py files in a Terminal of a server with 2 GPUs NVIDIA V100 32Go within a fastai v2 virtual environment.

1. Images Classification with ImageNette

File: train_imagenette.py

python -m fastai2.launch train_imagenette.py

In order to test the train_imagenette.py code with another dataset, I created the file 05_pet_breeds_DDP.py from the notebook 05_pet_breeds.ipynb that ran perfectly with DDP (Distributed Data Parallel) and fastai v2 as I divided by 3 the training and validation time using the following command:

python -m fastai2.launch 05_pet_breeds_DDP.py

To be compared to the training and validation times values in the notebook 05_pet_breeds.ipynb (just one GPU)…

petsfinetune

2. Texts Classification with IMDB

File: train_imdbclassifier-Copy1.py (I needed to setup DistributedTrainer.fup = True in the original file train_imdbclassifier.py)

python -m fastai2.launch train_imdbclassifier-Copy1.py

3. Tabular Classification with ADULTS

File: train_tabular.py

python -m fastai2.launch train_tabular.py

Note: when running python -m fastai2.launch script.py, if you get the error store = TCPStore(master_addr, master_port, start_daemon) RuntimeError: Address already in use in the Terminal, just launch the command ps -elf | grep python to get the PIDs of running python zombie processes. Then, you can kill them by PID (ex: kill -9 14275) or by file name (ex: pkill -9 -f script.py).

fastai v2 | Problems not solved with DDP

I wanted to run in DDP the Transformers tutorial using the code from train_imdbclassifier.py but it does not work (see this post).