If I launch python -m torch.distributed.launch --nproc_per_node=1 test.py, everything works normally, but if I try python -m torch.distributed.launch --nproc_per_node=2 test.py, the two processes are launched and I can see the memory of the GPU gets used, but the training never starts: the progress bar simply remains at 0%
The issue above seems to have solved, but now when I try the same thing with a TransformerXL, the training gets stuck in validation. Training goes well, validation never starts
I am sorry, I have not been looking at this for quite a while. Lack of speedups is quite common. Make sure you increase the batch size and learning rate accordingly. That is where most of the speedups will come from
@miko got it and thank you. So what I mean to say was to_distributed() added speedups of ~2x due to having 2 GPUs, however, data parallel doesn’t. In my case, validation set was also where it was hanging. I was under the impression, however, that we need to keep batch size the same (aka same batch size * #GPUs)
I will debug this a bit more today but the stack trace basically had it at return self._wait(timeout=timeout)
@miko did you manage to resolve Distributed training on multiple GPUs, I am having same issues. Training gets stuck on the first epoch, all 4 GPUs showing 100%. Im using the following.
Tensorflow 2.2 on Cuda 10.1