Multi-GPU Training unet_learner and parallel_ctx

I have been able to use 6 multiple GPU’s for training an image classification model cnn_learner by using the torch DataParallel command:

learn = cnn_learner(training_images, model, metrics=[F1,accuracy]).to_fp16()
learn.model = nn.DataParallel(learn.model)

However, on our segmentation model unet_learner, it does not work. After research I found out it needs the distributed data parallel such as in:

learn = unet_learner(dls, model_type, metrics=[foreground_acc, IoU], self_attention=True, act_cls=Mish, opt_func=opt).to_fp16()
learn.model = nn.parallel.DistributedDataParallel(learn.model)
learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

That has caused errors such

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Then I tried to initialize the distrib:

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'	
dist.init_process_group("nccl", rank=0, world_size=6)
learn.model = nn.parallel.DistributedDataParallel(learn.model)
learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

This spawns one GPU, seems to work but it hangs when fit happens.
Also tried removing the callbacks from the fit function, it didn’t work either.

Then I tried a different approach with:

gpu = None
n_gpu = torch.cuda.device_count()
ctx = learn.parallel_ctx if gpu is None and n_gpu else learn.distrib_ctx
with partial(ctx, gpu)():
    print(f"Training in {ctx.__name__} context on GPU {gpu if gpu is not None else list(range(n_gpu))}")
    learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

Finally multiple GPUs start working as expected!

But it throws an error after all 6 GPU’s start working in parallel:

RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

I’ve looked at other topics but none of them seem to address the case with unet_learner and this specific configuration.

How do I make sure inputs and weights are available on all gpus? or wherever they need to be?

Any help is very very much appreciated!!

By the way, I am also trying DistributedDataParallel (DDP).
It does not ERROR but it hangs before or at the fit function is called. Also, only one gpu spawns with python process.

	os.environ['RANK'] = '0'
	os.environ['WORLD_SIZE'] = '6'
	os.environ['MASTER_ADDR'] = '127.0.0.1'
	os.environ['MASTER_PORT'] = '29500'
	os.environ['CUDA_VISIBLE_DEVICES']='0,1,2,3,4,5'		
	torch.distributed.init_process_group(backend='nccl')
	
	learn.model = nn.parallel.DistributedDataParallel(learn.model)
	learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

Even if I remove the callbacks from the fit function to:

learn.fit(epochs, lr=lr)

I have also tried not initializing the os.env parameters in code, but then call the python file using

python -m fastai.launch or torch.distributedlaunch

both also hang in the same way.

Any reason why DDP hangs and only one gpu spawns?

Have you tried following the actual DDP integration through Accelerate fastai uses fully? (it seems like you’re just using parts of them here) Specifically the entire setup detailed here: fastai - Notebook Launcher examples

Or here: fastai - Notebook distributed training

I’m getting the same error and those notebooks don’t seem to solve the problem for me either. I’m training an LM on a SageMaker Studio instance with 4 GPUs. I’m trying to train with the following code:

gpu = None
    if torch.cuda.is_available():
        if gpu is not None: torch.cuda.set_device(gpu)
        n_gpu = torch.cuda.device_count()
    else:
        n_gpu = None
        
    if gpu is None and n_gpu is not None:
        ctx = learn.parallel_ctx
        with partial(ctx, gpu)():
            print(f"Training in {ctx.__name__} context on GPU {list(range(n_gpu))}")
            learn.fit_one_cycle(20, 1e-2, cbs=cbs_lm)

I think DataParallel is the right approach here instead of DDP, would that be right?