Distributed and parallel training... explained

Sorry, maybe my question wasn’t clear. The gradients for updating the model is averaged. But what about the actual loss and metrics?

For example, let’s say you finish an epoch. Then you have a validation set. It’s divided among the GPUs. So is the loss and metric is calculated on each GPU and then averaged? Or is the model applied on each GPU, the predictions are gathered from all the GPUs, and then loss and metrics are calculated on the whole dataset at once?

In PyTorch you can implement it either way. Huggingface implements it by averaging but apparently they claim that you cannot trust those metrics (see here). Do you know what way fastai2 implements it?

A paper on PyTorch Distributed published by the team:

Maybe some unanswered questions are answered here…

1 Like

fastai v2 and Transformers | Problems not solved with DDP

I wanted to run in DDP the Transformers tutorial of Sylvain using the code of train_imdbclassifier.py.

To do this, I created the script 39_tutorial.transformers_DDP.py that I ran with the following command in the same environment (server with 2 GPUs NVIDIA V100 32Go within a fastai v2 virtual environment) than the one of my (successful) tests with the fastai v2 scripts (see this post):

python -m fastai2.launch 39_tutorial.transformers_DDP.py

However, it did not work.
@ilovescience, @morgan, @wgpubs, @muellerzr, @sgugger: if you have an idea about it, you are welcome to post it. Thank you in advance.

Versions of frameworks: transformers==3.0.0 | fastai2==0.0.17

(fastai2) pierre@tesla:~/fastai2/nbs$ python -m fastai2.launch 39_tutorial.transformers_DDP.py

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Rank[0] Run: 0; epochs: 1; lr: 0.0001; bs: 8; sl: 1024
Rank[1] Run: 0; epochs: 1; lr: 0.0001; bs: 8; sl: 1024
Training in distributed data parallel context on GPU 1
Training in distributed data parallel context on GPU 0
epoch     train_loss  valid_loss  perplexity  time
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
  File "39_tutorial.transformers_DDP.py", line 66, in <module>
    runs:  Param("Number of times to repeat training", int)=1,
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
    return _f()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
    func(**args.__dict__)
  File "39_tutorial.transformers_DDP.py", line 126, in main
    learn.fit_one_cycle(epochs, lr)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 200, in fit
    self._do_epoch_train()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 175, in _do_epoch_train
    self.all_batches()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 153, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 98, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 107, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 128, in do_batch
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 127, in create_batch
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 45, in fa_collate
    return (default_collate(t) if isinstance(b, _collate_types)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [3029] at entry 0 and [4514] at entry 1

0         nan         00:00
^CTraceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 166, in distrib_ctx
    yield self
  File "39_tutorial.transformers_DDP.py", line 126, in main
    learn.fit_one_cycle(epochs, lr)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastcore/utils.py", line 431, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 200, in fit
    self._do_epoch_train()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 175, in _do_epoch_train
    self.all_batches()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py", line 153, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 98, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 107, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 128, in do_batch
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 127, in create_batch
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/data/load.py", line 45, in fa_collate
    return (default_collate(t) if isinstance(b, _collate_types)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [4382] at entry 0 and [4065] at entry 1


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "39_tutorial.transformers_DDP.py", line 66, in <module>
    runs:  Param("Number of times to repeat training", int)=1,
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
    return _f()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
    func(**args.__dict__)
  File "39_tutorial.transformers_DDP.py", line 126, in main
    learn.fit_one_cycle(epochs, lr)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 169, in distrib_ctx
    if cleanup_dpg: teardown_distrib()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/distributed.py", line 65, in teardown_distrib
    if torch.distributed.is_initialized(): torch.distributed.destroy_process_group()
KeyboardInterrupt
^CTraceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/launch.py", line 9, in <module>
    args:Param("Args to pass to script", nargs='...', opt=False)=''
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 76, in call_parse
    return _f()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastscript/core.py", line 73, in _f
    func(**args.__dict__)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/fastai2/launch.py", line 26, in main
    for process in processes: process.wait()
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1032, in wait
    self._wait(timeout=sigint_timeout)
  File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/subprocess.py", line 1647, in _wait
    time.sleep(delay)
KeyboardInterrupt
(fastai2) pierre@tesla:~/fastai2/nbs$
1 Like

I wish I could help but huggingface v.3 has currently broke all my transformer code :slight_smile:

I can tell you’re running v.3 from the warning messages above … are you sure the problem isn’t with v.3 and rather a problem with fastai v2? Just curious if this runs fine on a single gpu with the latest version of hugginface … and if not, I’d start there.

-wg

1 Like

Sorry I haven’t done any distributed work before

Also afraid to peak at v3 :sweat_smile:

Sorry!

1 Like

Thanks for your message @wgpubs but the problem is independent of the Transformers version. In fact, it does not come from Transformers v3: but the warning, it was the same problem with 2.11.0 (I updated today from 2.11.0 to 3.0.0).

And the Transformers tutorial of Sylvain works perfectly well with Transformers v3 on one GPU (at least on my server).

I think the problem is mainly here :

File "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [4382] at entry 0 and [4065] at entry 1

My understanding is that the training and validation datasets are distributed to the 2 processes (one by GPU), not the batches (Sequence Length of one batch in the Dataloaders = 1024). Then, the batches are created on each GPU but without taking care of the Sequence Length of 1024. As the datasets are a concatenation of texts with different length, torch.stack() can not process them.

The question is why the Dataloaders is not used at the process level when the mode is DDP in fastai v2?

2 Likes

To read about training with multiple GPUs!

Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups (Thomas Wolf - Hugging Face, Oct 15, 2018)

@pierreguillou, I’ve noticed in the distributed/parallel fastai docs (https://docs.fast.ai/distributed.html), there is a section for distributed dataloader.

In the parallel notebook: https://github.com/piegu/fastai-projects/blob/master/05_pet_breeds_DataParallel.ipynb

There is no such distributed dataloader. Do we need to write a distributed/parallel dataloader as well?

Parallel works out of the box, but I’m running into issues with distributed.

First, it appears that “learn.summary()” is not compatible with distributed training. You get a “AssertionError: Default process group is not initialized” error, which goes away when I commented out that line.

But then it gets stuck on the first epoch and never trains:

Training in distrib_ctx context on GPU 1
Training in distrib_ctx context on GPU 0
epoch     train_loss  valid_loss  time
Epoch 1/2 : |----------------------------------------------------------------------------------------------------| 0.00% [0/90 00:00<00:00]

I’m also using a custom loss function and custom dataloader… does that need to be modified too?

Ran into more issues. This time with parallel training.

I have exact same code with the exact same container, and one would work fine on machine A, and it would crash out on machine B.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

After much hacking, I found that it was learn.to_fp16() that was causing the issue! It looks like fp16 training does not sit well with parallel training. Some googling lead to hints that the weights were not distributed across all the GPUs correctly, and it’s related to the naming of GPU device ID? Does anyone know how to troubleshoot this?

Would you like to open a new issue on the fastai2 repo, with instructions on how to reproduce this error? I can take a look later (I wrote the distrib_ctx thingie in fastai v2 and the assertion looks familiar :wink: )

Thanks.

Phil

@pierreguillou any update of the error ? I found the same problem when I try to distribute the transformer

Hello @neuralconcept. Sorry but I did not try again and I did not receive solution about my post.

thanks, I think the problem is in dataloader, however I do not know how to implement it

Did you ever fix your specific attribute error 'Learner' object has no attribute 'distrib_ctx'?
I have the exact same issue where only torch.nn.DataParallel(learner.model) works.

1 Like

I had the same issue and resolved it by importing from fastai.distributed import *. Also remember to launch your training script using python -m fastai.launch train.py

The distributed example https://github.com/fastai/fastai/blob/master/nbs/examples/distrib.py is useful for pointing out details missed in the docs.

2 Likes

Thank you. This was my issue ttoo and now its working!

I still have problems with nn.DataParallel(learn.model).

RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0

Which doesn’t make sense because I am trying to run it across 8 GPUs.

I have come to two possibilities.

  1. My data loader is mismatching with the learner and I need to fix devices. pierreguillou’s example don’t seem to work for me.
  2. Its just impossible right now. Fastai v2 text - #431 by chess



I am also running into this same issue on Sagemaker Studio at the moment on a multi-GPU instance. Did you manage to find a fix for this?

I have a code which runs on all my four gpus (I see it on nvidia-smi). But each epoch takes longer on 4 gpus than in one!

If i run without specifying DataParallel:

learn = vision_learner(dls, resnet152, metrics=error_rate)
learn.fit_one_cycle(n_epoch=1)

epoch	train_loss	valid_loss	error_rate	time
0	0.018546	0.014250	0.003585	01:19

With dataparallel

learn = vision_learner(dls, resnet152, metrics=error_rate)
learn.model = torch.nn.DataParallel(learn.model) # send it to all gpus
learn.fit_one_cycle(n_epoch=1)

epoch	train_loss	valid_loss	error_rate	time
0.049150	0.033304	0.007547	01:36

What can I be doing wrong?