Multi-GPU Training unet_learner and parallel_ctx

hector.ferronato · November 29, 2022, 10:13pm

I have been able to use 6 multiple GPU’s for training an image classification model cnn_learner by using the torch DataParallel command:

learn = cnn_learner(training_images, model, metrics=[F1,accuracy]).to_fp16()
learn.model = nn.DataParallel(learn.model)

However, on our segmentation model unet_learner, it does not work. After research I found out it needs the distributed data parallel such as in:

learn = unet_learner(dls, model_type, metrics=[foreground_acc, IoU], self_attention=True, act_cls=Mish, opt_func=opt).to_fp16()
learn.model = nn.parallel.DistributedDataParallel(learn.model)
learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

That has caused errors such

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Then I tried to initialize the distrib:

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'	
dist.init_process_group("nccl", rank=0, world_size=6)
learn.model = nn.parallel.DistributedDataParallel(learn.model)
learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

This spawns one GPU, seems to work but it hangs when fit happens.
Also tried removing the callbacks from the fit function, it didn’t work either.

Then I tried a different approach with:

gpu = None
n_gpu = torch.cuda.device_count()
ctx = learn.parallel_ctx if gpu is None and n_gpu else learn.distrib_ctx
with partial(ctx, gpu)():
    print(f"Training in {ctx.__name__} context on GPU {gpu if gpu is not None else list(range(n_gpu))}")
    learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

Finally multiple GPUs start working as expected!

But it throws an error after all 6 GPU’s start working in parallel:

RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

I’ve looked at other topics but none of them seem to address the case with unet_learner and this specific configuration.

How do I make sure inputs and weights are available on all gpus? or wherever they need to be?

Any help is very very much appreciated!!

hector.ferronato · November 30, 2022, 3:29pm

By the way, I am also trying DistributedDataParallel (DDP).
It does not ERROR but it hangs before or at the fit function is called. Also, only one gpu spawns with python process.

	os.environ['RANK'] = '0'
	os.environ['WORLD_SIZE'] = '6'
	os.environ['MASTER_ADDR'] = '127.0.0.1'
	os.environ['MASTER_PORT'] = '29500'
	os.environ['CUDA_VISIBLE_DEVICES']='0,1,2,3,4,5'		
	torch.distributed.init_process_group(backend='nccl')
	
	learn.model = nn.parallel.DistributedDataParallel(learn.model)
	learn.fit(epochs, lr=lr, cbs=[SaveModelCallback(monitor="foreground_acc", comp=np.greater)])

Even if I remove the callbacks from the fit function to:

learn.fit(epochs, lr=lr)

I have also tried not initializing the os.env parameters in code, but then call the python file using

python -m fastai.launch or torch.distributedlaunch

both also hang in the same way.

Any reason why DDP hangs and only one gpu spawns?

muellerzr · November 30, 2022, 3:54pm

Have you tried following the actual DDP integration through Accelerate fastai uses fully? (it seems like you’re just using parts of them here) Specifically the entire setup detailed here: fastai - Notebook Launcher examples

Or here: fastai - Notebook distributed training

adeperio · December 7, 2022, 4:34am

I’m getting the same error and those notebooks don’t seem to solve the problem for me either. I’m training an LM on a SageMaker Studio instance with 4 GPUs. I’m trying to train with the following code:

gpu = None
    if torch.cuda.is_available():
        if gpu is not None: torch.cuda.set_device(gpu)
        n_gpu = torch.cuda.device_count()
    else:
        n_gpu = None
        
    if gpu is None and n_gpu is not None:
        ctx = learn.parallel_ctx
        with partial(ctx, gpu)():
            print(f"Training in {ctx.__name__} context on GPU {list(range(n_gpu))}")
            learn.fit_one_cycle(20, 1e-2, cbs=cbs_lm)

I think DataParallel is the right approach here instead of DDP, would that be right?