Confusion about running out of memory on GPU (due to allocated memory increasing non-linearly to batch size)

I’m working with a dataset of 10k training images. With an image size of 512px, it seems that the amount of memory required is 1GB * batch_size (i.e. a batch size of 8 requires 8GB of memory) – at least according to the CUDA errors I am getting.

However, when I try to use additional GPUs or a GPU with more RAM (4x K80s or P4000 -> P6000), I’m not able to increase my batch size as much as I want because the allocated memory increases. What is the allocated memory? Is there a way to reduce it?

It just doesn’t make sense that the largest batch size I can use on 4 GPUs (K80, so 48GB total of RAM) is 8. Or on Paperspace’s P6000, the allocated memory increases to something crazy like 21GB and I can’t even fit a batch size of 4 (when I should be able to fit at least 16).

Any help would be very appreciated as this has been an extremely frustrating experience :pensive:

A lot of distributed training setups take the batch size argument and use that on each GPU so you may actually have a global batch size of 4 * 8.

1 Like

GPU memory is used to store both the images in the current batch as well as the model parameters. The model parameters are often a significant proportion of the memory used. Because of this you can’t think about memory required in terms of memory per image * batch size.

There are two broad approaches to parrallelizing training, Model- and Data-parrallel. In both cases each GPU needs to store all the model weights and the images that are in the current batch. As @rwightman suggests, when you parrallelize, your effective batch size is actually the batch size * the number of GPUs. The thing you have to watch out for though is that when your batch size gets small the batchnorm statistics can get screwy. Pytorch provides functionality to syncronize batchnorm statistics across different devices.

This is a really good thread to read when it comes to GPU memory usage: Understanding GPU memory usage

3 Likes

I don’t think that this is the case because when I have it at something like 8 or 16, the CUDA error shows that I need 2 GB or 4 GB respectively (if I’m using 4 GPUs).

I wouldn’t use the allocation failure messages to try and figure out exactly what you need, it’s showing what allocation it’s failing on and you don’t have the clear picture of the full sequence of allocations.

I don’t know what sort of model your running, and you haven’t shared the training code/setup. But when you’re working with larger image sizes and big networks batch sizes this small aren’t unusual. I’m currently running an object detection training session on 48GB of GPU memory, running in FP16, and I get 6 images per GPU (two GPU) with a resnext 50 backbone… resolution scaling long edge btw 800-1000 pixels.

Thanks I’ll check out that thread :slight_smile:

GPU memory is used to store both the images in the current batch as well as the model parameters. The model parameters are often a significant proportion of the memory used. Because of this you can’t think about memory required in terms of memory per image * batch size.

I get that, but wouldn’t the model params be constant? Even when I am using a single GPU and a constant batch size, the allocated memory changes (if anything it seems like it uses a percentage of available memory).

For instance, if I use a P4000 I can’t use a batch size >4 and allocated memory is around 7-8 GB (for bs of 4). If I up that to a P5000, then allocation increases to roughly 10-12GB and then if I increase to P6000, then allocation increases to 21GB (and crazily enough, I can no longer fit batch size of 4). The only thing that is changing is batch size and the GPU type/size and allocation continues to increase so that I can’t fit more than 4 images in a batch.

I wouldn’t use the allocation failure messages to try and figure out exactly what you need, it’s showing what allocation it’s failing on and you don’t have the clear picture of the full sequence of allocations.

Could you explain this a little more? Not sure if I understand what you mean.

I don’t know what sort of model your running, and you haven’t shared the training code/setup.

I’m using XResNet 34 and can easily train large batches on 128px and 256px, but can’t get anything above 8 images per batch (and that is 4x K80s w/ 48 GB total) when using 512x512px. So unless, XResNet is pretty inefficient when it comes to GPU memory usage, both my network and images are smaller yet I can only fit a batch size that is 75% of yours.

XResNet34 is not horribly inefficient. It’s one of the better models for GPU memory usage.

The dominant (changing) factor in the GPU memory usage for fully conv networks is going to be WxHxB of your input. That means every doubling of resolution is a 4x increase. To go from 256x256 to 512x512 and maintain similar memory usage you need to reduce the batch dim by 1/4.

Allocators are complex, it’s not a single massive block that depends on your input size and your network. As the data moves through the ops, memory needs to be allocated for the result of each op (or not if it’s inplace, but most aren’t). This is true for the foward pass and backward pass. So lots of blocks of different sizes need to be allocated. Some CUDNN ops like convolutions use different amounts of memory depending on the kernel size and characteristics of the op – they will choose a different algorithm under the hood that may need to reorganize the data for faster processing but incur more memory usage in the process. Pytorch also has a caching layer that tries to reuse blocks of compatible sizes (if it can) and avoid going down to CUDA allocator. The first one of these allocations to fail will throw the error that you see, it could be a 64mib allocation when you’ve got 28mib free, or a 4gib alloc when you’ve got 3gib free. Inferring what your total overrun is from any given error message isn’t possibe. Easiest way to find the limit is to start small enough to have it working and keep increasing WxH or B until it fails.