I am curious, how does fastai2 distributed training take into account the training and validation losses and metrics? With the dataset divided among the GPUs, are the losses and metrics just averaged over all the GPUs? Or are they calculated on the whole dataset on a single GPU? Are you aware of anything like this?
As fastai v2 DDP uses full PyTorch, the answer to your question is in the Pytorch doc.
For example, here.
This container (
torch.nn.parallel.
DistributedDataParallel()
) parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.
Data Parallel (DP) and Distributed Data Parallel (DDP) training in Pytorch and fastai v2
For training a Deep Learning model in parallel using PyTorch or fastai v2, there are 2 modes: DataParallel
(DP) and Distributed Data Parallel
(DDP) but you should use DDP instead of DP (see below for explications).
1. Pytorch | How to train a model across multi-GPUs?
Pytorch | Data Parallel (DP)
nn.DataParallel
(DP) is for performing one-process on multiple devices of a single machine .
As an example, it can perform the training of your Deep Learning model (which is a process) by distributing it on many GPUs of a single machine (GPU is a device).
-
How? By distributing batches of the training and validation dataloaders on the GPUs available. This is data parallelism at the
module
level. - Positive : batch size can be bigger as batches will be equally distributed to all GPUs
- Negative: just-one-process is a bottleneck that can increase process time
This is the official definition from PyTorch:
- This container parallelizes the application of the given
module
by splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device). In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.- The batch size should be larger than the number of GPUs used.
- WARNING: It is recommended to use
DistributedDataParallel
, instead of this class, to do multi-GPU training, even if there is only a single node. See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel and Distributed Data Parallel.
PyTorch code
You will find the full code of the notebook 05_pet_breeds.ipynb using Data Parallel PyTorch code in my notebook 05_pet_breeds_DataParallel.ipynb (nbviewer version).
The lines chaves are the following ones:
if torch.cuda.device_count() > 1:
learn.model = nn.DataParallel(learn.model)
PyTorch forum
- About Data Parallel and DataParallel
Pytorch | Distributed Data Parallel (DDP)
nn.parallel.DistributedDataParallel
(DDP) is useful when you want to perform multi-processes on devices of multiple machines but you can use it on devices of just a single machine as well: differently than DataParallel
, within DDP each device (GPU) performs independently one copy of the process on a part of the training dataset (this is true process and data parallelism).
This is the official definition from PyTorch:
- DDP implements distributed data parallelism that is based on
torch.distributed
package at the module level.- This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.
- The batch size should be larger than the number of GPUs used locally.
- See also: Basics and Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. The same constraints on input as in
torch.nn.DataParallel
apply.- Creation of this class requires that
torch.distributed
to be already initialized, by callingtorch.distributed.init_process_group()
.DistributedDataParallel
is proven to be significantly faster thantorch.nn.DataParallel
for single-node multi-GPU data parallel training.
PyTorch forum
PyTorch code
- ImageNet training in PyTorch: this implements training of popular model architectures (file main.py), such as ResNet, AlexNet, and VGG on the ImageNet dataset with multi-processing Distributed Data Parallel Training:
PyTorch tutorials
-
Distributed data parallel training in Pytorch
- Motivation: The easiest way to speed up neural network training is to use a GPU, which provides large speedups over CPUs on the types of calculations (matrix multiplies and additions) that are common in neural networks. As the model or dataset gets bigger, one GPU quickly becomes insufficient. For example, big language models such as BERT and GPT-2 are trained on hundreds of GPUs. To multi-GPU training, we must have a way to split the model and data between different GPUs and to coordinate the training.
-
Why distributed data parallel?: I like to implement my models in Pytorch because I find it has the best balance between control and ease of use of the major neural-net frameworks. Pytorch has two ways to split models and data across multiple GPUs:
nn.DataParallel
andnn.DistributedDataParallel
.nn.DataParallel
is easier to use (just wrap the model and run your training script). However, because it uses one process to compute the model weights and then distribute them to each GPU during each batch, networking quickly becomes a bottle-neck and GPU utilization is often very low. Furthermore,nn.DataParallel
requires that all the GPUs be on the same node and doesn’t work with Apex for mixed-precision training.
-
PyTorch Distributed Training
- PyTorch has relatively simple interface for distributed training. To do distributed training, the model would just have to be wrapped using
DistributedDataParallel
and the training script would just have to be launched usingtorch.distributed.launch
. Although PyTorch has offered a series of tutorials on distributed training, I found it insufficient or overwhelming to help the beginners to do state-of-the-art PyTorch distributed training. Some key details were missing and the usages of Docker container in distributed training were not mentioned at all. - In this blog post, I would like to present a simple implementation of PyTorch distributed training on CIFAR-10 classification using
DistributedDataParallel
wrapped ResNet models. The usage of Docker container for distributed training and how to start distributed training usingtorch.distributed.launch
would also be covered.
- PyTorch has relatively simple interface for distributed training. To do distributed training, the model would just have to be wrapped using
-
Getting Started with Distributed Data Parallel
-
DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers. More specifically, DDP registers an autograd hook for each parameter given by
model.parameters()
and the hook will fire when the corresponding gradient is computed in the backward pass. Then DDP uses that signal to trigger gradient synchronization across processes. Please refer to DDP design note for more details. - The recommended way to use DDP is to spawn one process for each model replica, where a model replica can span multiple devices. DDP processes can be placed on the same machine or across machines, but GPU devices cannot be shared across processes. This tutorial starts from a basic DDP use case and then demonstrates more advanced use cases including checkpointing models and combining DDP with model parallel.
-
DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers. More specifically, DDP registers an autograd hook for each parameter given by
-
Distributed Data Parallel:
torch.nn.parallel.DistributedDataParallel
(DDP) transparently performs distributed data parallel training. This page describes how it works and reveals implementation details.
2. fastai v2 | How to train a model across multi-GPUs?
fastai v2 implements the 2 modes as documented in Distributed and parallel training fastai v2 doc.
fastai v2 | Data Parallel (DP)
This is the simplest one. Use the class functions of the class ParallelTrainer
.
You will find the full code of the notebook 05_pet_breeds.ipynb using Data Parallel fastai v2 code in my notebook 05_pet_breeds_DataParallel.ipynb (nbviewer version).
The lines chaves are the following ones:
ctx = learn.parallel_ctx
with partial(ctx, gpu)():
print(f"Training in {ctx.__name__} context on GPU {list(range(n_gpu))}")
learn.fine_tune(2)
fastai v2 | Distributed Data Parallel (DDP)
This is the most fast one as it will train truly in parallel all the model copies (one model copy by GPU). Use the class functions of the class DistributedTrainer
.
As you will need to launch at least 2 process in parallel, this can not be done in a Jupyter notebook but in a Terminal via the following command to be run at the script.py
path within a fastai v2 virtual environment:
python -m fastai2.launch script.py
with fastai2.launch
that refers to launch.py and script.py
that contains the DDP formatted training code like the following files in fastai > fastai2 > nbs > examples in github:
- train_imagenette.py (images classification)
- train_imdbclassifier.py (texts classification)
- train_tabular.py (regression)
Notebooks on DDP from fastai v2
I run the following commands at the path of the py
files in a Terminal of a server with 2 GPUs NVIDIA V100 32Go within a fastai v2 virtual environment.
1. Images Classification with ImageNette
File: train_imagenette.py
python -m fastai2.launch train_imagenette.py
In order to test the train_imagenette.py code with another dataset, I created the file 05_pet_breeds_DDP.py from the notebook 05_pet_breeds.ipynb that ran perfectly with DDP (Distributed Data Parallel) and fastai v2 as I divided by 3 the training and validation time using the following command:
python -m fastai2.launch 05_pet_breeds_DDP.py
To be compared to the training and validation times values in the notebook 05_pet_breeds.ipynb (just one GPU)…
2. Texts Classification with IMDB
File: train_imdbclassifier-Copy1.py (I needed to setup DistributedTrainer.fup = True
in the original file train_imdbclassifier.py)
python -m fastai2.launch train_imdbclassifier-Copy1.py
3. Tabular Classification with ADULTS
File: train_tabular.py
python -m fastai2.launch train_tabular.py
Note: when running python -m fastai2.launch script.py
, if you get the error store = TCPStore(master_addr, master_port, start_daemon) RuntimeError: Address already in use
in the Terminal, just launch the command ps -elf | grep python
to get the PIDs of running python zombie processes. Then, you can kill them by PID (ex: kill -9 14275
) or by file name (ex: pkill -9 -f script.py
).
fastai v2 | Problems not solved with DDP
I wanted to run in DDP the Transformers tutorial using the code from train_imdbclassifier.py but it does not work (see this post).
Sorry, maybe my question wasn’t clear. The gradients for updating the model is averaged. But what about the actual loss and metrics?
For example, let’s say you finish an epoch. Then you have a validation set. It’s divided among the GPUs. So is the loss and metric is calculated on each GPU and then averaged? Or is the model applied on each GPU, the predictions are gathered from all the GPUs, and then loss and metrics are calculated on the whole dataset at once?
In PyTorch you can implement it either way. Huggingface implements it by averaging but apparently they claim that you cannot trust those metrics (see here). Do you know what way fastai2 implements it?
A paper on PyTorch Distributed published by the team:
Maybe some unanswered questions are answered here…
fastai v2 and Transformers | Problems not solved with DDP
I wanted to run in DDP the Transformers tutorial of Sylvain using the code of train_imdbclassifier.py.
To do this, I created the script 39_tutorial.transformers_DDP.py that I ran with the following command in the same environment (server with 2 GPUs NVIDIA V100 32Go within a fastai v2 virtual environment) than the one of my (successful) tests with the fastai v2 scripts (see this post):
python -m fastai2.launch 39_tutorial.transformers_DDP.py
However, it did not work.
@ilovescience, @morgan, @wgpubs, @muellerzr, @sgugger: if you have an idea about it, you are welcome to post it. Thank you in advance.
Versions of frameworks: transformers==3.0.0 | fastai2==0.0.17
(fastai2) pierre@tesla:~/fastai2/nbs$ python -m fastai2.launch 39_tutorial.transformers_DDP.py
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Rank[0] Run: 0; epochs: 1; lr: 0.0001; bs: 8; sl: 1024
Rank[1] Run: 0; epochs: 1; lr: 0.0001; bs: 8; sl: 1024
Training in distributed data parallel context on GPU 1
Training in distributed data parallel context on GPU 0
epoch train_loss valid_loss perplexity time
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "39_tutorial.transformers_DDP.py", line 66, in <module>
runs: Param("Number of times to repeat training", int)=1,
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
return _f()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
func(**args.__dict__)
File "39_tutorial.transformers_DDP.py", line 126, in main
learn.fit_one_cycle(epochs, lr)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
return inst if to_return else f(*args, **kwargs)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 113, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
return inst if to_return else f(*args, **kwargs)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 200, in fit
self._do_epoch_train()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 175, in _do_epoch_train
self.all_batches()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 153, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 98, in __iter__
for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
data = next(self.dataset_iter)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 107, in create_batches
yield from map(self.do_batch, self.chunkify(res))
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 128, in do_batch
def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 127, in create_batch
def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 45, in fa_collate
return (default_collate(t) if isinstance(b, _collate_types)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [3029] at entry 0 and [4514] at entry 1
0 nan 00:00
^CTraceback (most recent call last):
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 166, in distrib_ctx
yield self
File "39_tutorial.transformers_DDP.py", line 126, in main
learn.fit_one_cycle(epochs, lr)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
return inst if to_return else f(*args, **kwargs)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 113, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
return inst if to_return else f(*args, **kwargs)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 200, in fit
self._do_epoch_train()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 175, in _do_epoch_train
self.all_batches()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 153, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 98, in __iter__
for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
data = next(self.dataset_iter)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 107, in create_batches
yield from map(self.do_batch, self.chunkify(res))
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 128, in do_batch
def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 127, in create_batch
def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 45, in fa_collate
return (default_collate(t) if isinstance(b, _collate_types)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [4382] at entry 0 and [4065] at entry 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "39_tutorial.transformers_DDP.py", line 66, in <module>
runs: Param("Number of times to repeat training", int)=1,
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
return _f()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
func(**args.__dict__)
File "39_tutorial.transformers_DDP.py", line 126, in main
learn.fit_one_cycle(epochs, lr)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 169, in distrib_ctx
if cleanup_dpg: teardown_distrib()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 65, in teardown_distrib
if torch.distributed.is_initialized(): torch.distributed.destroy_process_group()
KeyboardInterrupt
^CTraceback (most recent call last):
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/launch.py", line 9, in <module>
args:Param("Args to pass to script", nargs='...', opt=False)=''
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
return _f()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
func(**args.__dict__)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/launch.py", line 26, in main
for process in processes: process.wait()
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1032, in wait
self._wait(timeout=sigint_timeout)
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1647, in _wait
time.sleep(delay)
KeyboardInterrupt
(fastai2) pierre@tesla:~/fastai2/nbs$
I wish I could help but huggingface v.3 has currently broke all my transformer code
I can tell you’re running v.3 from the warning messages above … are you sure the problem isn’t with v.3 and rather a problem with fastai v2? Just curious if this runs fine on a single gpu with the latest version of hugginface … and if not, I’d start there.
-wg
Sorry I haven’t done any distributed work before
Also afraid to peak at v3
Sorry!
Thanks for your message @wgpubs but the problem is independent of the Transformers version. In fact, it does not come from Transformers v3: but the warning, it was the same problem with 2.11.0 (I updated today from 2.11.0 to 3.0.0).
And the Transformers tutorial of Sylvain works perfectly well with Transformers v3 on one GPU (at least on my server).
I think the problem is mainly here :
File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [4382] at entry 0 and [4065] at entry 1
My understanding is that the training and validation datasets are distributed to the 2 processes (one by GPU), not the batches (Sequence Length of one batch in the Dataloaders
= 1024). Then, the batches are created on each GPU but without taking care of the Sequence Length of 1024. As the datasets are a concatenation of texts with different length, torch.stack()
can not process them.
The question is why the Dataloaders
is not used at the process level when the mode is DDP in fastai v2?
To read about training with multiple GPUs!
Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups (Thomas Wolf - Hugging Face, Oct 15, 2018)
@pierreguillou, I’ve noticed in the distributed/parallel fastai docs (https://docs.fast.ai/distributed.html), there is a section for distributed dataloader.
In the parallel notebook: https://github.com/piegu/fastai-projects/blob/master/05_pet_breeds_DataParallel.ipynb
There is no such distributed dataloader. Do we need to write a distributed/parallel dataloader as well?
Parallel works out of the box, but I’m running into issues with distributed.
First, it appears that “learn.summary()” is not compatible with distributed training. You get a “AssertionError: Default process group is not initialized” error, which goes away when I commented out that line.
But then it gets stuck on the first epoch and never trains:
Training in distrib_ctx context on GPU 1 Training in distrib_ctx context on GPU 0 epoch train_loss valid_loss time Epoch 1/2 : |----------------------------------------------------------------------------------------------------| 0.00% [0/90 00:00<00:00]
I’m also using a custom loss function and custom dataloader… does that need to be modified too?
Ran into more issues. This time with parallel training.
I have exact same code with the exact same container, and one would work fine on machine A, and it would crash out on machine B.
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
After much hacking, I found that it was learn.to_fp16() that was causing the issue! It looks like fp16 training does not sit well with parallel training. Some googling lead to hints that the weights were not distributed across all the GPUs correctly, and it’s related to the naming of GPU device ID? Does anyone know how to troubleshoot this?
Would you like to open a new issue on the fastai2 repo, with instructions on how to reproduce this error? I can take a look later (I wrote the distrib_ctx thingie in fastai v2 and the assertion looks familiar )
Thanks.
Phil
@pierreguillou any update of the error ? I found the same problem when I try to distribute the transformer
thanks, I think the problem is in dataloader, however I do not know how to implement it
Did you ever fix your specific attribute error 'Learner' object has no attribute 'distrib_ctx'
?
I have the exact same issue where only torch.nn.DataParallel(learner.model)
works.
I had the same issue and resolved it by importing from fastai.distributed import *
. Also remember to launch your training script using python -m fastai.launch train.py
The distributed example https://github.com/fastai/fastai/blob/master/nbs/examples/distrib.py is useful for pointing out details missed in the docs.
Thank you. This was my issue ttoo and now its working!