DistributedDataParallel init hanging

kcturgutlu · March 18, 2019, 4:34am

Yes, it is a single machine with 8 GPUs. That was my initial approach but then I got the following error:

learn = cnn_learner(data=fold_data, base_arch=arch, metrics=[accuracy, auc], 
                    lin_ftrs=[1024,1024], ps=[0.7, 0.7, 0.7],
                    callbacks=learn_callbacks,
                    callback_fns=learn_callback_fns)
learn.to_distributed(cuda_id=0)
learn.fit(1)

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Most of the answers around data parallelism in the forums use nn.DataParallel and couldn’t find a working solution in PyTorch forums as well.

Then regarding to that error message I set the following but it keeps hanging:

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['WORLD_SIZE'] = '4'
os.environ['RANK'] = '0'
torch.distributed.init_process_group(backend='nccl')

This error is not fastai related but there might be someone who faced a similar issue.