DistributedDataParallel init hanging

I am trying to do single node multi-gpu (4 gpus) training with DistributedDataParallel using to_distributed():

# environment vars
os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3'
os.environ['MASTER_ADDR'] = ''
os.environ['MASTER_PORT'] = '5444'
os.environ['WORLD_SIZE'] = '4'


code above hangs and I believe it’s hanging during torch.distributed.init_process_group(backend='nccl', init_method='env://', rank=0).

If you’re launching on just one machine, you don’t need to specify those env variables normally, and fastai launch is enough to do everything properly for you.


Yes, it is a single machine with 8 GPUs. That was my initial approach but then I got the following error:

learn = cnn_learner(data=fold_data, base_arch=arch, metrics=[accuracy, auc], 
                    lin_ftrs=[1024,1024], ps=[0.7, 0.7, 0.7],

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Most of the answers around data parallelism in the forums use nn.DataParallel and couldn’t find a working solution in PyTorch forums as well.

Then regarding to that error message I set the following but it keeps hanging:

os.environ['MASTER_ADDR'] = ''
os.environ['MASTER_PORT'] = '29500'
os.environ['WORLD_SIZE'] = '4'
os.environ['RANK'] = '0'

This error is not fastai related but there might be someone who faced a similar issue.


Does the CIFAR10 example hang as well?


python -m fastai.launch train_cifar.py --gpu=3

It throws an Division by Zero since n_gpu is set to 0 by:

def num_distrib():
    "Return the number of processes in distributed training (if applicable)."
    return int(os.environ.get('WORLD_SIZE', 0))

Then I set inside main func of the script:

os.environ['WORLD_SIZE'] = '4'

and run python -m fastai.launch train_cifar.py --gpu=3
This time getting

Traceback (most recent call last):
  File "train_cifar.py", line 8, in <module>
    def main( gpu:Param("GPU to run on", str)=None ):
  File "/home/turgutluk/fastai/fastai/script.py", line 40, in call_parse
  File "train_cifar.py", line 23, in main
  File "/home/turgutluk/fastai/fastai/vision/data.py", line 108, in from_folder
    if valid_pct is None: src = il.split_by_folder(train=train, valid=valid)
  File "/home/turgutluk/fastai/fastai/data_block.py", line 199, in split_by_folder
    return self.split_by_idxs(self._get_by_folder(train), self._get_by_folder(valid))
  File "/home/turgutluk/fastai/fastai/data_block.py", line 195, in _get_by_folder
    return [i for i in range_of(self) if self.items[i].parts[self.num_parts]==name]
  File "/home/turgutluk/fastai/fastai/data_block.py", line 195, in <listcomp>
    return [i for i in range_of(self) if self.items[i].parts[self.num_parts]==name]
IndexError: index 0 is out of bounds for axis 0 with size 0

But when I set all these environment variables:

os.environ['WORLD_SIZE'] = '4'
os.environ['RANK'] = '3'
os.environ['MASTER_ADDR'] = ''
os.environ['MASTER_PORT'] = '1234'

then it hangs.

python fastai/fastai/launch.py --gpus=3,4,5,6 fastai/examples/train_cifar.py --gpu=3

python fastai/fastai/launch.py --gpus=3,4,5,6 fastai/examples/train_cifar.py --gpu=3

It works, I was missing the CIFAR data after all. I will use fastai/launch.py to spawn process for my own script and see if it works.

Another question I have is that I am only seeing utilization at gpu=3 looking at watch gpustat but I was expecting it to be distributed across gpus=3,4,5,6. Am I missing something? It looks like there are 4 processes on running on gpu=3.

[3] GeForce RTX 2080 Ti | 87’C, 99 % | 9084 / 10989 MB | turgutluk(2263M) turgutluk(2271M) turgutluk(2269M) turgutluk(2271M)
[4] GeForce RTX 2080 Ti | 33’C, 0 % | 10 / 10989 MB |
[5] GeForce RTX 2080 Ti | 35’C, 0 % | 10 / 10989 MB |
[6] GeForce RTX 2080 Ti | 29’C, 0 % | 10 / 10989 MB |


It should be like this since list(‘3456’) is the correct format:

python fastai/fastai/launch.py --gpus=3456 fastai/examples/train_cifar.py

When using learn.to_distributed() in a Jupyter notebook, there is the same issue :
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

You can’t run distributed training in jupyter, it needs to be in a script (it needs to launch several copies of the training for the different GPUs and it’s not possible in jupyter).



@sgugger, @sebastienwood et. all I followed your instructions to create a script to do distributed training but as a newbie I am not sure it is taking place. I see processes running in GPU but volatile memory is 0%…! Do not know what it means but I find it weird. How would you interpret this GPU stats?

Note I added a line to save the trained model (I think it was missing, right?) but for the rest is the same as in the docs.

EDIT: I realized that GPU volatile memory does jump for all 8 GPUs from time to time to nearly 100% (like spikes) but most of the times is 0% as in the picture.