Distributed and parallel training... explained

pierreguillou · June 25, 2020, 9:16pm

[ EDIT - 06/29/2020 ] Do not follow this post about Data Parallel (DP) and Distributed Data Parallel (DDP) training in PyTorch and fastai v2: it contains wrong code. Instead, you should look at the Distributed and parallel training fastai v2 documentation and this guide I published. Inside, you will find an explanation about how to train a notebook using fastai v2 DDP (Distributed Data Parallel) via a terminal … of course

Hello,

I’m a lucky guy: I have access to 2 GPUs Nvidia V100

I would like to use the 2 when training a model with fastai v2… but when I try to apply codes from Distributed and parallel training, it fails

For example, in the notebook 05_pet_breeds.ipynb, if I use the following code, it does not work:

… but if I use nn.DataParallel(), it works very well!

Anyone with a HowToGuide about how to train a model with fastai v2 in parallel?

Note: by the way, what is the difference between Distributed and parallel training? Thanks.

ilovescience · June 26, 2020, 1:52am

I was just curious whether or not DataParallel models work with fastai2 (I am also lucky enough to have access to multiple GPUs). You just answered my question!

Regarding the difference, here’s a good PyTorch forum thread:

pierreguillou · June 29, 2020, 11:59am

Thank you @ilovescience for the link.

I’ve published an example of code that runs perfectly with DDP (Distributed Data Parallel) into a Terminal on my machine with 2 GPUs V100 NVIDIA 32Go thanks to the fastai v2 distributed training code.

See it in my guide about Data Parallel (DP) and Distributed Data Parallel (DDP) training in PyTorch and fastai v2.

However, I tried to do the same thing with the Transformers tutorial (notebook 39_tutorial.transformers.ipynb) from @sgugger but it does not work (until now). Any suggestions to make it work? (see my post)

ilovescience · June 29, 2020, 5:52pm

I am curious, how does fastai2 distributed training take into account the training and validation losses and metrics? With the dataset divided among the GPUs, are the losses and metrics just averaged over all the GPUs? Or are they calculated on the whole dataset on a single GPU? Are you aware of anything like this?

pierreguillou · June 29, 2020, 9:13pm

As fastai v2 DDP uses full PyTorch, the answer to your question is in the Pytorch doc.
For example, here.

This container (torch.nn.parallel. DistributedDataParallel()) parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.

pierreguillou · June 29, 2020, 9:16pm

Data Parallel (DP) and Distributed Data Parallel (DDP) training in Pytorch and fastai v2

For training a Deep Learning model in parallel using PyTorch or fastai v2, there are 2 modes: DataParallel (DP) and Distributed Data Parallel (DDP) but you should use DDP instead of DP (see below for explications).

1. Pytorch | How to train a model across multi-GPUs?

Pytorch | Data Parallel (DP)

nn.DataParallel (DP) is for performing one-process on multiple devices of a single machine .

As an example, it can perform the training of your Deep Learning model (which is a process) by distributing it on many GPUs of a single machine (GPU is a device).

How? By distributing batches of the training and validation dataloaders on the GPUs available. This is data parallelism at the module level.
Positive : batch size can be bigger as batches will be equally distributed to all GPUs
Negative: just-one-process is a bottleneck that can increase process time

This is the official definition from PyTorch:

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device). In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.

The batch size should be larger than the number of GPUs used.

WARNING: It is recommended to use DistributedDataParallel , instead of this class, to do multi-GPU training, even if there is only a single node. See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel and Distributed Data Parallel.

PyTorch code

You will find the full code of the notebook 05_pet_breeds.ipynb using Data Parallel PyTorch code in my notebook 05_pet_breeds_DataParallel.ipynb (nbviewer version).

The lines chaves are the following ones:

if torch.cuda.device_count() > 1:
    learn.model = nn.DataParallel(learn.model)

PyTorch forum

About Data Parallel and DataParallel

Pytorch | Distributed Data Parallel (DDP)

nn.parallel.DistributedDataParallel (DDP) is useful when you want to perform multi-processes on devices of multiple machines but you can use it on devices of just a single machine as well: differently than DataParallel, within DDP each device (GPU) performs independently one copy of the process on a part of the training dataset (this is true process and data parallelism).

This is the official definition from PyTorch:

DDP implements distributed data parallelism that is based on torch.distributed package at the module level.

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.

See also: Basics and Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. The same constraints on input as in torch.nn.DataParallel apply.

Creation of this class requires that torch.distributed to be already initialized, by calling torch.distributed.init_process_group() .

DistributedDataParallel is proven to be significantly faster than torch.nn.DataParallel for single-node multi-GPU data parallel training.

PyTorch forum

About Distributed Data Parallel and DistributedDataParallel

PyTorch code

ImageNet training in PyTorch: this implements training of popular model architectures (file main.py), such as ResNet, AlexNet, and VGG on the ImageNet dataset with multi-processing Distributed Data Parallel Training:
- Single node, multiple GPUs
- Multiple nodes

PyTorch tutorials

Distributed data parallel training in Pytorch
- Motivation: The easiest way to speed up neural network training is to use a GPU, which provides large speedups over CPUs on the types of calculations (matrix multiplies and additions) that are common in neural networks. As the model or dataset gets bigger, one GPU quickly becomes insufficient. For example, big language models such as BERT and GPT-2 are trained on hundreds of GPUs. To multi-GPU training, we must have a way to split the model and data between different GPUs and to coordinate the training.
- Why distributed data parallel?: I like to implement my models in Pytorch because I find it has the best balance between control and ease of use of the major neural-net frameworks. Pytorch has two ways to split models and data across multiple GPUs: nn.DataParallel and nn.DistributedDataParallel . nn.DataParallel is easier to use (just wrap the model and run your training script). However, because it uses one process to compute the model weights and then distribute them to each GPU during each batch, networking quickly becomes a bottle-neck and GPU utilization is often very low. Furthermore, nn.DataParallel requires that all the GPUs be on the same node and doesn’t work with Apex for mixed-precision training.
PyTorch Distributed Training
- PyTorch has relatively simple interface for distributed training. To do distributed training, the model would just have to be wrapped using DistributedDataParallel and the training script would just have to be launched using torch.distributed.launch . Although PyTorch has offered a series of tutorials on distributed training, I found it insufficient or overwhelming to help the beginners to do state-of-the-art PyTorch distributed training. Some key details were missing and the usages of Docker container in distributed training were not mentioned at all.
- In this blog post, I would like to present a simple implementation of PyTorch distributed training on CIFAR-10 classification using DistributedDataParallel wrapped ResNet models. The usage of Docker container for distributed training and how to start distributed training using torch.distributed.launch would also be covered.
Getting Started with Distributed Data Parallel
- DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers. More specifically, DDP registers an autograd hook for each parameter given by model.parameters() and the hook will fire when the corresponding gradient is computed in the backward pass. Then DDP uses that signal to trigger gradient synchronization across processes. Please refer to DDP design note for more details.
- The recommended way to use DDP is to spawn one process for each model replica, where a model replica can span multiple devices. DDP processes can be placed on the same machine or across machines, but GPU devices cannot be shared across processes. This tutorial starts from a basic DDP use case and then demonstrates more advanced use cases including checkpointing models and combining DDP with model parallel.
Distributed Data Parallel: torch.nn.parallel.DistributedDataParallel (DDP) transparently performs distributed data parallel training. This page describes how it works and reveals implementation details.

2. fastai v2 | How to train a model across multi-GPUs?

fastai v2 implements the 2 modes as documented in Distributed and parallel training fastai v2 doc.

fastai v2 | Data Parallel (DP)

This is the simplest one. Use the class functions of the class ParallelTrainer.

You will find the full code of the notebook 05_pet_breeds.ipynb using Data Parallel fastai v2 code in my notebook 05_pet_breeds_DataParallel.ipynb (nbviewer version).

The lines chaves are the following ones:

ctx = learn.parallel_ctx
with partial(ctx, gpu)():
    print(f"Training in {ctx.__name__} context on GPU {list(range(n_gpu))}")
    learn.fine_tune(2)

fastai v2 | Distributed Data Parallel (DDP)

This is the most fast one as it will train truly in parallel all the model copies (one model copy by GPU). Use the class functions of the class DistributedTrainer.

As you will need to launch at least 2 process in parallel, this can not be done in a Jupyter notebook but in a Terminal via the following command to be run at the script.py path within a fastai v2 virtual environment:

python -m fastai2.launch script.py

with fastai2.launch that refers to launch.py and script.py that contains the DDP formatted training code like the following files in fastai > fastai2 > nbs > examples in github:

train_imagenette.py (images classification)
train_imdbclassifier.py (texts classification)
train_tabular.py (regression)

Notebooks on DDP from fastai v2

I run the following commands at the path of the py files in a Terminal of a server with 2 GPUs NVIDIA V100 32Go within a fastai v2 virtual environment.

1. Images Classification with ImageNette

File: train_imagenette.py

python -m fastai2.launch train_imagenette.py

In order to test the train_imagenette.py code with another dataset, I created the file 05_pet_breeds_DDP.py from the notebook 05_pet_breeds.ipynb that ran perfectly with DDP (Distributed Data Parallel) and fastai v2 as I divided by 3 the training and validation time using the following command:

python -m fastai2.launch 05_pet_breeds_DDP.py

To be compared to the training and validation times values in the notebook 05_pet_breeds.ipynb (just one GPU)…

petsfinetune

2. Texts Classification with IMDB

File: train_imdbclassifier-Copy1.py (I needed to setup DistributedTrainer.fup = True in the original file train_imdbclassifier.py)

python -m fastai2.launch train_imdbclassifier-Copy1.py

3. Tabular Classification with ADULTS

File: train_tabular.py

python -m fastai2.launch train_tabular.py

Note: when running python -m fastai2.launch script.py, if you get the error store = TCPStore(master_addr, master_port, start_daemon) RuntimeError: Address already in use in the Terminal, just launch the command ps -elf | grep python to get the PIDs of running python zombie processes. Then, you can kill them by PID (ex: kill -9 14275) or by file name (ex: pkill -9 -f script.py).

fastai v2 | Problems not solved with DDP

I wanted to run in DDP the Transformers tutorial using the code from train_imdbclassifier.py but it does not work (see this post).

ilovescience · June 29, 2020, 9:32pm

Sorry, maybe my question wasn’t clear. The gradients for updating the model is averaged. But what about the actual loss and metrics?

For example, let’s say you finish an epoch. Then you have a validation set. It’s divided among the GPUs. So is the loss and metric is calculated on each GPU and then averaged? Or is the model applied on each GPU, the predictions are gathered from all the GPUs, and then loss and metrics are calculated on the whole dataset at once?

In PyTorch you can implement it either way. Huggingface implements it by averaging but apparently they claim that you cannot trust those metrics (see here). Do you know what way fastai2 implements it?

ilovescience · June 30, 2020, 2:18am

A paper on PyTorch Distributed published by the team:

Maybe some unanswered questions are answered here…

pierreguillou · July 1, 2020, 8:03pm

fastai v2 and Transformers | Problems not solved with DDP

I wanted to run in DDP the Transformers tutorial of Sylvain using the code of train_imdbclassifier.py.

To do this, I created the script 39_tutorial.transformers_DDP.py that I ran with the following command in the same environment (server with 2 GPUs NVIDIA V100 32Go within a fastai v2 virtual environment) than the one of my (successful) tests with the fastai v2 scripts (see this post):

python -m fastai2.launch 39_tutorial.transformers_DDP.py

However, it did not work.
@ilovescience, @morgan, @wgpubs, @muellerzr, @sgugger: if you have an idea about it, you are welcome to post it. Thank you in advance.

Versions of frameworks: transformers==3.0.0 | fastai2==0.0.17

(fastai2) pierre@tesla:~/fastai2/nbs$ python -m fastai2.launch 39_tutorial.transformers_DDP.py

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Rank[0] Run: 0; epochs: 1; lr: 0.0001; bs: 8; sl: 1024
Rank[1] Run: 0; epochs: 1; lr: 0.0001; bs: 8; sl: 1024
Training in distributed data parallel context on GPU 1
Training in distributed data parallel context on GPU 0
epoch     train_loss  valid_loss  perplexity  time
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
  File "39_tutorial.transformers_DDP.py", line 66, in <module>
    runs:  Param("Number of times to repeat training", int)=1,
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
    return _f()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
    func(**args.__dict__)
  File "39_tutorial.transformers_DDP.py", line 126, in main
    learn.fit_one_cycle(epochs, lr)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 200, in fit
    self._do_epoch_train()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 175, in _do_epoch_train
    self.all_batches()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 153, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 98, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 107, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 128, in do_batch
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 127, in create_batch
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 45, in fa_collate
    return (default_collate(t) if isinstance(b, _collate_types)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [3029] at entry 0 and [4514] at entry 1

0         nan         00:00
^CTraceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 166, in distrib_ctx
    yield self
  File "39_tutorial.transformers_DDP.py", line 126, in main
    learn.fit_one_cycle(epochs, lr)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 200, in fit
    self._do_epoch_train()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 175, in _do_epoch_train
    self.all_batches()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 153, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 98, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 107, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 128, in do_batch
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 127, in create_batch
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 45, in fa_collate
    return (default_collate(t) if isinstance(b, _collate_types)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [4382] at entry 0 and [4065] at entry 1


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "39_tutorial.transformers_DDP.py", line 66, in <module>
    runs:  Param("Number of times to repeat training", int)=1,
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
    return _f()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
    func(**args.__dict__)
  File "39_tutorial.transformers_DDP.py", line 126, in main
    learn.fit_one_cycle(epochs, lr)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 169, in distrib_ctx
    if cleanup_dpg: teardown_distrib()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 65, in teardown_distrib
    if torch.distributed.is_initialized(): torch.distributed.destroy_process_group()
KeyboardInterrupt
^CTraceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/launch.py", line 9, in <module>
    args:Param("Args to pass to script", nargs='...', opt=False)=''
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
    return _f()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
    func(**args.__dict__)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/launch.py", line 26, in main
    for process in processes: process.wait()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1032, in wait
    self._wait(timeout=sigint_timeout)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1647, in _wait
    time.sleep(delay)
KeyboardInterrupt
(fastai2) pierre@tesla:~/fastai2/nbs$

wgpubs · July 1, 2020, 9:12pm

I wish I could help but huggingface v.3 has currently broke all my transformer code

I can tell you’re running v.3 from the warning messages above … are you sure the problem isn’t with v.3 and rather a problem with fastai v2? Just curious if this runs fine on a single gpu with the latest version of hugginface … and if not, I’d start there.

-wg

morgan · July 1, 2020, 9:44pm

Sorry I haven’t done any distributed work before

Also afraid to peak at v3

Sorry!

pierreguillou · July 1, 2020, 9:57pm

Thanks for your message @wgpubs but the problem is independent of the Transformers version. In fact, it does not come from Transformers v3: but the warning, it was the same problem with 2.11.0 (I updated today from 2.11.0 to 3.0.0).

And the Transformers tutorial of Sylvain works perfectly well with Transformers v3 on one GPU (at least on my server).

I think the problem is mainly here :

File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [4382] at entry 0 and [4065] at entry 1

My understanding is that the training and validation datasets are distributed to the 2 processes (one by GPU), not the batches (Sequence Length of one batch in the Dataloaders = 1024). Then, the batches are created on each GPU but without taking care of the Sequence Length of 1024. As the datasets are a concatenation of texts with different length, torch.stack() can not process them.

The question is why the Dataloaders is not used at the process level when the mode is DDP in fastai v2?

pierreguillou · August 27, 2020, 12:21pm

To read about training with multiple GPUs!

Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups (Thomas Wolf - Hugging Face, Oct 15, 2018)

ai_padawan · August 28, 2020, 10:23pm

@pierreguillou, I’ve noticed in the distributed/parallel fastai docs (https://docs.fast.ai/distributed.html), there is a section for distributed dataloader.

In the parallel notebook: https://github.com/piegu/fastai-projects/blob/master/05_pet_breeds_DataParallel.ipynb

There is no such distributed dataloader. Do we need to write a distributed/parallel dataloader as well?

ai_padawan · August 29, 2020, 3:30am

Parallel works out of the box, but I’m running into issues with distributed.

First, it appears that “learn.summary()” is not compatible with distributed training. You get a “AssertionError: Default process group is not initialized” error, which goes away when I commented out that line.

But then it gets stuck on the first epoch and never trains:

Training in distrib_ctx context on GPU 1
Training in distrib_ctx context on GPU 0
epoch     train_loss  valid_loss  time
Epoch 1/2 : |----------------------------------------------------------------------------------------------------| 0.00% [0/90 00:00<00:00]

I’m also using a custom loss function and custom dataloader… does that need to be modified too?

ai_padawan · August 30, 2020, 1:29am

Ran into more issues. This time with parallel training.

I have exact same code with the exact same container, and one would work fine on machine A, and it would crash out on machine B.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

After much hacking, I found that it was learn.to_fp16() that was causing the issue! It looks like fp16 training does not sit well with parallel training. Some googling lead to hints that the weights were not distributed across all the GPUs correctly, and it’s related to the naming of GPU device ID? Does anyone know how to troubleshoot this?

philchu · September 4, 2020, 4:20am

Would you like to open a new issue on the fastai2 repo, with instructions on how to reproduce this error? I can take a look later (I wrote the distrib_ctx thingie in fastai v2 and the assertion looks familiar )

Thanks.

Phil

neuralconcept · October 13, 2020, 9:51pm

@pierreguillou any update of the error ? I found the same problem when I try to distribute the transformer

pierreguillou · October 15, 2020, 11:55am

Hello @neuralconcept. Sorry but I did not try again and I did not receive solution about my post.

neuralconcept · October 18, 2020, 6:29pm

thanks, I think the problem is in dataloader, however I do not know how to implement it