Distributed training stuck when using multiple GPUs

I am trying to reproduce the tutorial that is in the docs on a machine with 2 GPUs.

I have saved the following (which is copied from the docs)

from fastai.vision import *
from fastai.vision.models.wrn import wrn_22
from fastai.distributed import *
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')

path = untar_data(URLs.CIFAR)
ds_tfms = ([*rand_pad(4, 32), flip_lr(p=0.5)], [])
data = ImageDataBunch.from_folder(path, valid='test', ds_tfms=ds_tfms, bs=128).normalize(cifar_stats)
learn = Learner(data, wrn_22(), metrics=accuracy).to_distributed(args.local_rank)
learn.fit_one_cycle(2, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)

in a test.py script.

If I launch python -m torch.distributed.launch --nproc_per_node=1 test.py, everything works normally, but if I try python -m torch.distributed.launch --nproc_per_node=2 test.py, the two processes are launched and I can see the memory of the GPU gets used, but the training never starts: the progress bar simply remains at 0%

The issue above seems to have solved, but now when I try the same thing with a TransformerXL, the training gets stuck in validation. Training goes well, validation never starts

1 Like

@miko were you able to resolve it? I am getting the same issue with to_distributed. Data parallel isn’t giving me any speedups.

I am sorry, I have not been looking at this for quite a while. Lack of speedups is quite common. Make sure you increase the batch size and learning rate accordingly. That is where most of the speedups will come from

@miko got it and thank you. So what I mean to say was to_distributed() added speedups of ~2x due to having 2 GPUs, however, data parallel doesn’t. In my case, validation set was also where it was hanging. I was under the impression, however, that we need to keep batch size the same (aka same batch size * #GPUs)

I will debug this a bit more today but the stack trace basically had it at return self._wait(timeout=timeout)

@miko did you manage to resolve Distributed training on multiple GPUs, I am having same issues. Training gets stuck on the first epoch, all 4 GPUs showing 100%. Im using the following.
Tensorflow 2.2 on Cuda 10.1