Multiple GPU: How to get gains in training speed

There seems to be some confusion on various forum posts about potential gains around using to_distributed and speeding up model training. I ran an experiment so that the results that can hopefully shed some light on the matter and help guide where it makes sense to try and use more than one GPU.

Starting with the documentation about launch a distributed model I setup a small script and then also a small bash shell script to launch the models. Scripts and results all posted here for others to replicate as they wish (and to replicate on machines with more than 2 GPUs to confirm that you can get further speed up in those cases.)

image

	# Batch/Epoch	Epoch Time (mm:ss)		
bs	1 GPU	2 GPU	1 GPU	2 GPU	Gain
4	12,500	6,250	15:17	8:16 	1.85
8	 6,250	3,125	 8:02	4:15 	1.89
16	 3,125	1,562	 3:48	2:06 	1.81
32 	 1,562	  781 	 1:59	1:07 	1.78
64	   781	  390 	 1:04	0:38 	1.68
128	   390	  195 	 0:35	0:23 	1.52
256	   195	   97 	 0:22	0:16 	1.38
512	    97	   48 	 0:21	0:17 	1.24

From this data, I believe you get the biggest gains when you have small batches. If your batch size is large (64 in this case) you get very little gain for the extra hardware. I believe this is because as your batch size gets larger the number of batches per epoch goes down and you get less and less gain of running in parallel on two separate GPU cards.

So, I think that if you have a problem where you cannot load much on the GPU (either your model is large, your data is large or both) then going to more than a single GPU should help you reduce the time-per-epoch.

I am very curious if others have similar results/experience with this kind of test.

from fastai.vision import *
from fastai.callbacks.mem import PeakMemMetric
from fastai.distributed import *
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--local_rank",type=int)
parser.add_argument("--bs",type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl',init_method='env://')

batch_size = args.bs

path = untar_data(URLs.CIFAR)
ds_tfms = ([*rand_pad(4,32),flip_lr(p=0.5)],[])
data = ImageDataBunch.from_folder(path,valid='test',ds_tfms=ds_tfms,bs=batch_size).normalize(cifar_stats)

print(f'bs: {batch_size}, num_batches:{len(data.train_dl)}')
learn = cnn_learner(data,models.resnet50,metrics=accuracy,callback_fns=PeakMemMetric).to_distributed(args.local_rank)
learn.unfreeze()
learn.fit_one_cycle(1,3e-3,wd=0.5,div_factor=10,pct_start=0.5)

Bash script for experiments:

#!/bin/bash
for bs in 4 8 16 32 64 128 256 512
          do for gpus in 1 2
             do
                 python -m torch.distributed.launch --nproc_per_node=$gpus time_distributed.py --bs=$bs
             done
          done
2 Likes

I used

learn.model = nn.DataParallel(learn.model)

and got a speed up of around 6 on a classification example (resnet50) with 2 GPUs. Contrary to your observation, I used a big batch size of 512, but on a huge dataset with 100k images.

Would be really happy if this could be applied to segmentation as well, but at the moment that does not work, unfortunately.

Is that on a public dataset? Can you share that code so I can test out locally?

It is a research data set of our group that I am not allowed to share, at least at the moment, sorry.

Regarding the code, you just have to add the line

learn.model = nn.DataParallel(learn.model)

when your Learner is called learn. Nothing else has to be done, you could e.g. use the pets classification example from the course.

If you need a big dataset, just take Imagenet or the small version from

2 Likes

The to_parallel wraps and does the same as you describe (which will split the batch across cores.) to_distributed sets up and runs nn.DistributedDataParallel which is a bit different (duplication of the model & batch across multiple cards creating less batches per epoch.)

I cannot get close to your 6x speed up. In fact, I see slowdowns in some cases (small batches, for example.) I am curious what is special about your data or arch or testing method that achieves this. What can you share about that?

Here is what I have tried:

  • Imagenette x5 (to get 83k images as input) I don’t care about how unique the images are, but I want a lot to get close to your 100k image profile.
  • I have 12gb per GPU card (Titan Xp)
  • 64 pixel image size (larger cannot fit large batch sizes onto my cards…)
  • resnet50
  • apply learn.model = nn.DataParallel(learn.model) for GPU >1, else don’t do anything special.

I feel like there is something simple I am missing. Code posted below. Is it similar to yours?

Bash runner:

#!/bin/bash                                                                                                                            
for bs in 512 256                                                                                     
          do for gpus in 1 2                                                                                                           
             do                                                                                                                        
                 python3 time_dist_DataParallel.py --bs=$bs --ngpus=$gpus                                                              
             done                                                                                                                      
          done 

Python script:

from fastai.vision import *                                                                                                            
from fastai.callbacks.mem import PeakMemMetric                                                                                         
import argparse                                                                                                                        
                                                                                                                                       
parser = argparse.ArgumentParser()                                                                                                     
parser.add_argument("--bs",type=int)                                                                                                   
parser.add_argument("--ngpus",type=int)                                                                                                
args = parser.parse_args()                                                                                                             
batch_size = args.bs                                                                                                                   
path = untar_data(URLs.IMAGENETTE) #this has a bunch of extra copies to get ~ 83k images                                               
workers = 16                                                                                                                           
size = 64                                                                                                                              
data = ImageList.from_folder(path).split_by_folder(valid='val')\                                                                       
    .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)\                                                                 
    .databunch(bs=batch_size, num_workers=workers)\                                                                                    
    .presize(size, scale=(0.35,1))\                                                                                                    
    .normalize(imagenet_stats)                                                                                                         
                                                                                                                                       
print(f'bs: {batch_size}, num_batches:{len(data.train_dl)}')                                                                           
learn = cnn_learner(data,models.resnet50,metrics=accuracy,callback_fns=PeakMemMetric)                                                  
if args.ngpus !=1:                                                                                                                     
    learn.model = nn.DataParallel(learn.model)                                                                                         
                                                                                                                                       
learn.fit_one_cycle(1,3e-3,wd=0.5,div_factor=10,pct_start=0.5)

Output:

bs: 512, num_batches:162
epoch     train_loss  valid_loss  accuracy  cpu used  peak      gpu used  peak      time    
0         3.793614    0.962922    0.800000  0         3         156       8670      01:45      
Total time: 01:45
bs: 512, num_batches:162
epoch     train_loss  valid_loss  accuracy  cpu used  peak      gpu used  peak      time    
0         3.806999    1.024818    0.784000  0         3         266       5262      01:45      
Total time: 01:45
bs: 256, num_batches:325
epoch     train_loss  valid_loss  accuracy  cpu used  peak      gpu used  peak      time    
0         3.709135    0.964027    0.802000  0         3         90        5414      01:40      
Total time: 01:40
bs: 256, num_batches:325
epoch     train_loss  valid_loss  accuracy  cpu used  peak      gpu used  peak      time    
0         3.724478    0.973158    0.806000  0         3         190       11084     01:46      
Total time: 01:46

What differs in my setup is:

  • I have 32 GB per GPU (Tesla V100)
  • My images are 128 x 128
  • I additionally used mixed precision training (learn.to_fp16())

I didn’t use any bash runner and printing, I jut run inside a jupyter notebook and the timings are taken from the ShowGraph Callback. On a single GPU one epoch took on average 6:12, on two GPUs 0:59.

I also tested with a dataset of size 1000, there I have nearly no impact. Single GPU was at 0:04 and two GPUs at 0:03 an average.

Hope that helps.

@ptrampert Thank you for that detail. That really helped me to realize some more about what is going on with this stuff. I am still not able to replicate your 6x speed up, but I am so glad that is working for you! I suspect that it could have to do with a lot of other system differences including memory access speed, #cores and speed of cores on the V100 card. I just don’t know enough to track that down, but found some big differences when I googled the spec sheet for the V100.

Back to the tests. First, I had an error in my prior code. I did not learn.unfreeze() so the list of learnable parameters was quite limited. (for just the head, 2MM parameters rather than 25MM for resnet50.) This is why my times were the same for both.
Second, changing to to_fp16() allowed me to run a bigger batch size with the same card. That was very helpful.

I ran some more tests with resnet50, but switched to resnet101 to get more trainable params (42 MM) and find the impact of splitting across GPUs. Here is what I found:

With nn.DataParallel:
image

With to_distributed():
image

And this fits what I researched about these two approaches:

  1. nn.DataParallel will split the batch by n_gpus and then replicate the model on each and collect the gradients before making a step. So in the case of the 512 model size, one gets that large batch split and so you can run without CUDA OOM errors (since we find that 256 fits on a single card.)
  2. to_distributed() sends the model across with the same model size (in the case of 256 batch size, you get them running in parallel) and you get a slight gain with this method over the nn.DataParallel method.

One take away is that if you want to run a bigger batch, then nn.DataParallel will allow you to get that bigger batch across two GPUs. From my tests so far, it would appear that you would do even better if you run with to_distributed() with half the batch size but my tests have been limited.

The other detail I am realizing that that “your mileage may vary” depending on the number of trainable params, the size of the data set, etc, etc. So, in short, it would seem sensible to test out a few approaches and see what might work for your data/arch.

I am curious if others have found different optima for other problems.

2 Likes

Hello Patrick, Do you know why DataParallel will not work for a segmentation task ?. I was having a need to apply dataparallism to a segmentation task and was planning to try that. thanks

Hi. Unfortunately, I could not get it to work as well. There seems to be an issue inside PyTorch causing this. It might be possible to perform distributed learning with a different parallelization paradigm (nn.parallel.distributed), but I did not try, yet.

1 Like

I am still in the beginning stages of exploring my multi-gpu vs single ones. Is there a difference between running in Jupyter vs. running it straight with python?

My massive data subset from the Kaggle Google competition has around 3 Million images. Setting up as either 1 GPU or 2 GPUs in the Jupyter notebooks seems to take the same time (~ 2 hours). Looking at Nvidia-smi it appears that it is properly loading on both from Jupyter.

Code:

When running the CIFAR script, I do notice a speed from 1 to 2 GPUs. From 45 seconds to 26 seconds.

Guesses for causes include:

  1. Image resizing (fixing that now)
  2. Slower Processor on machine
  3. Faulty estimates from a large dataset

I believe this has to be run outside of Jupyter to work properly. Dump out your notebook as a script and see if you don’t get a speed up.

1 Like

Edit: Fixed it in Jupyter as I was pulling everything under the data and added workers. I got the speed increase.

data = ImageList.from_csv(’/home/jd/data/google/256/’,csv_name=‘5splitFor45594Split1.csv’,
folder=‘train’, suffix=’.jpg’)
.split_from_df(col=‘is_valid’)
.label_from_df(cols=‘landmark_id’)
.add_test_folder(‘test’)
.transform(tfms, size=imgsize)
.databunch(bs=bs,num_workers=workers)
.normalize(imagenet_stats)

Old message below

Giving it a try, stuck on this error. Will try it on a different dataset soon.

RuntimeError: cuda runtime error (59) : device-side assert triggered

Which appears to be related to labels containing a -1 and makes me wonder if using src to then put in data is causing confusion.

from fastai.script import *
from fastai.vision import *
from fastai.vision.models.wrn import wrn_22
from fastai.distributed import *
torch.backends.cudnn.benchmark = True

@call_parse
def main( gpu:Param("GPU to run on", str)=None ):
    """Distrubuted training similar to  CIFAR-10."""
    gpu = setup_distrib(gpu)
    n_gpus = 2
    path= "/home/jd/data/google/256/"
    tfms = ([*rand_pad(4, 32), flip_lr(p=0.5)], [])
    workers = min(16, num_cpus()//n_gpus)

    src = (ImageList.from_csv('/home/jd/data/google/256/',csv_name='Small5splitFor45594Split1.c$
        .split_from_df(col='is_valid')
        .label_from_df(cols='landmark_id'))

    data = (src.transform(tfms,
                      resize_method=ResizeMethod.SQUISH,
                      size=256)
       .databunch(bs=256//n_gpus,num_workers=workers)
       .normalize(imagenet_stats))

    learn = Learner(data, wrn_22(), metrics=accuracy)
    learn.to_distributed(gpu)
    learn.to_fp16()
    learn.fit_one_cycle(35, 3e-3, wd=0.4)

do any of you guys happen to know if, when we use to_distributed , fastai replaces BatchNorm Layers with SyncBatchNorm ?

https://pytorch.org/docs/master/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm

I’m trying to get multi gpu work together to counterbalance huge data that makes my batch-size very small… but if each GPU ends up with the same batch-size than before, then having two gpus doesn’t perform anything else than gradient accumulation. It might be faster, but won’t give me any batch-size related boost in performance…

Any thoughts ? Do you know how I could replace the batchNorm layers with SyncBatchNorm ?

edit: I am no expert at navigating the github, but from what I understand, distributed.py was last modified in March last year :

and SyncBatchNorm was added to Pytorch one month later :

So, probably not. What would be the best way to add this feat ?

Just add net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net) to the DistributedTrainer callback ?

@sgugger sorry for bothering you with v1, but from what I gather here in v2, I believe this syncBatchNorm thing might be of interest to you, since it doesn’t seem implemented in v2