Confusion about running out of memory on GPU (due to allocated memory increasing non-linearly to batch size)

simonjhb · September 5, 2019, 7:34pm

GPU memory is used to store both the images in the current batch as well as the model parameters. The model parameters are often a significant proportion of the memory used. Because of this you can’t think about memory required in terms of memory per image * batch size.

There are two broad approaches to parrallelizing training, Model- and Data-parrallel. In both cases each GPU needs to store all the model weights and the images that are in the current batch. As @rwightman suggests, when you parrallelize, your effective batch size is actually the batch size * the number of GPUs. The thing you have to watch out for though is that when your batch size gets small the batchnorm statistics can get screwy. Pytorch provides functionality to syncronize batchnorm statistics across different devices.

This is a really good thread to read when it comes to GPU memory usage: Understanding GPU memory usage