How to use multiple gpus

I have the same problem as @soorajviraat i.e. when training language models only one gpu memory is fully utilized (in the case of CNNs everything is working correctly). I’ve run some benchmarks. Here are the results:

CNN

ARCH, GPU type, Dataset

note: jeremy suggested that limiting factor might be loaders and to use torchvision to fix it. It’s an old post. I don’t know if it the problem was already addressed so I’m planning to check it and add the results to this post

Resnet34, 4xK80 GPU, FastAI PETS

  • 16 : single_GPU/parallel, 0:58/1:20 min each, 4:00 min total, valid loss single/par 0.21/0.27
  • 64 : single_GPU/parallel, 0:46/0:32 min each, 2:30 min total, valid loss single/par 0.21/0.21
  • 256 : single_GPU/parallel/fp+p, 0:50/0:27/0:26 min each, 2:00 min total, valid loss single/par/p+fp 0.23/0.24/0.25
  • 1024 does not fit to single GPU; par/par+par+fp 0:40/0:39 valid loss 0.359/0.36

Language Model (RNN)

GPU type, Dataset

4xK80 GPU, FastAI idbm

AWD_LSTM

  • bs 48 singleGPU/parallel/fp/p+fp 1:24:00/1:23:00/1:54:20/1:00:00
  • bs 96 singleGPU/fp/p+fp n.a./1:03:00/1:03:00
  • bs 136 par 0:58:00

Transformer

  • bs 36 singleGPU/par 3:12:00/3:15:00
  • bs 48 singleGPU/fp/par 3:15:00/6:40:00/2:40:00
  • bs 96 singleGPU/parallel/p+fp n.a./1:40:00/2:20:00

TransformerXL

  • bs 96 singleGPU/parallel/p+fp n.a./1:13:00/

Classifier

AWD_LSTM

  • bs 100 singleGPU/parallel 09:15/5:52
  • bs 136 singleGPU/parallel/fp/p+fp total n.a./5:17/12:39/5:17

Code I used for parralelization is following

learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.model = torch.nn.DataParallel(learn.model)

I’m not sure what’s the cause. Maybe it is working correctly (we can see a speed up in language models even though memory is limiting the batch size and thus parallelization speed up). It would be great if someone experienced could interpret the results we have here