Platform: Amazon SageMaker - AWS

Anyone have luck using data parallel (dp) or distributed data parallel (ddp) with sagemaker instances or training jobs? I have tried using learner.parallel_ctx and learner.distributed_ctx on an ml.p3.8xlarge training instance (4x V100), but it is training at same speed as p3.2xlarge.

Based off reading this post, Distributed and parallel training... explained - #6 by pierreguillou, it seems using ddp won’t work since sagemaker will kick off 1 python process when launching a train script. It also seems with the p3.8xlarge, the GPUs are only available as parallel not distributed:

rank_distrib() == 0
num_distrib() == 0
torch.cuda.device_count() == 4

Despite trying both distributed and parallel ctx, can’t seem to get the increase to multiple GPUs working.

Edit - Made a post detailing what I’ve tried. Haven’t had much luck with anything more than 1 GPU with SageMaker.