DataParallel and DistributedDataParallel on AWS SageMaker performance

I am trying to make use of either distributed or parallel training using fastai and SageMaker notebooks or training jobs (somewhat fixed on using this service based on my team). I am running code on a ml.p3.8xlarge with 4x V100, but I cannot get any speed ups with any of the approaches I have taken.

After spinning up the ml.p3.8xlarge notebook instance, here is the set up in my notebook using the pytorch env:

%%bash
pip install fastai==2.0.0 fastcore==1.0.0
sudo mkdir -p /opt/ml/input/data/collab
sudo chmod 777 /opt/ml/input/data/collab

Here is the code I am testing:

import fastai, fastcore, torch
print(f'fastai {fastai.__version__}')
print(f'fastcore {fastcore.__version__}')
print(f'torch {torch.__version__}')

from fastai.collab import *
from fastai.tabular.all import *
from fastai.distributed import *

path = untar_data(URLs.ML_100k, dest="/opt/ml/input/data/collab")

ratings = pd.read_csv(
    path/'u.data',
    delimiter='\t',
    header=None,
    names=['user','movie','rating','timestamp']
)

movies = pd.read_csv(
    path/'u.item',
    delimiter='|',
    encoding='latin-1',
    usecols=(0,1),
    names=['movie','title'],
    header=None,
)

ratings = ratings.merge(movies)

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)

n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 64

model = EmbeddingDotBias(n_factors, n_users, n_movies)

learn = Learner(dls, model, loss_func=MSELossFlat())

print(learn.model)

print("rank_distrib():", rank_distrib())
print("num_distrib():", num_distrib())
print("torch.cuda.device_count():", torch.cuda.device_count())

epochs, lr = 5, 5e-3

print('learn.fit_one_cycle')
learn.fit_one_cycle(epochs, lr)

print('with learn.distrib_ctx():')
with learn.distrib_ctx():
    learn.fit_one_cycle(epochs, lr)

print('with learn.distrib_ctx(torch.cuda.device_count()-1):')
with learn.distrib_ctx(torch.cuda.device_count()-1):
    learn.fit_one_cycle(epochs, lr)

print('with learn.parallel_ctx():')
with learn.parallel_ctx():
    learn.fit_one_cycle(epochs, lr)

print('nn.DataParallel(learn.model)')
if torch.cuda.device_count() > 1:
    learn.model = nn.DataParallel(learn.model)
learn.fit_one_cycle(epochs, lr)

Here is the output from running code as a script:

sh-4.2$ /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python /home/ec2-user/SageMaker/cf.py
fastai 2.0.0
fastcore 0.1.39
torch 1.6.0
EmbeddingDotBias(
  (u_weight): Embedding(944, 64)
  (i_weight): Embedding(1665, 64)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)
rank_distrib(): 0
num_distrib(): 0
torch.cuda.device_count(): 4
learn.fit_one_cycle
epoch     train_loss  valid_loss  time
0         1.153435    1.154428    00:11
1         0.957201    0.954827    00:11
2         0.816548    0.878350    00:11
with learn.distrib_ctx():
epoch     train_loss  valid_loss  time
0         0.999254    1.040871    00:11
1         0.821853    0.914921    00:11
2         0.658059    0.845227    00:11
with learn.distrib_ctx(torch.cuda.device_count()-1):
epoch     train_loss  valid_loss  time
0         0.749317    0.997568    00:11
1         0.580846    0.912386    00:11
2         0.381058    0.878295    00:11
with learn.parallel_ctx():
epoch     train_loss  valid_loss  time
0         0.514148    1.025872    00:25
1         0.383893    0.996381    00:18
2         0.204836    0.970403    00:18
nn.DataParallel(learn.model)
epoch     train_loss  valid_loss  time
0         0.341708    1.103849    00:16
1         0.272570    1.067705    00:16
2         0.134262    1.055507    00:16

Using the command nvidia-smi dmon -s u to watch GPU usage, I can see that only the training with DataParallel (using with learn.parallel_ctx(): and nn.DataParallel(learn.model)) show GPU ids 1,2,3 being used. The problem is the data parallel is slower, even when I have tried increasing batch size or embedding size.

Any help with this would be appreciated. I have a much larger collaborative filtering model I would like to use that is experiencing the same issues as this movie example and I need to reduce the training time hopefully with the use of parallel/distributed training.