Debugging CUDA out of memory (16GB GPU, 14+ GB reserved)

I’m pretty new and want to learn how to debug GPU memory allocation.

My setup:

  • paperspace, machine with A4000 16G GPU
  • single notebook running
  • playing with DINOv2, just using the embedding part with pre-trained weights
  • inspecting the model, it has ~427M params, so even with float32 that should be around 1.7GB
  • loading 280x280 images which I want to get embedding, 100 images x 280x280x3, with float32 should be under 100MB

I’m still getting RuntimeError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 15.73 GiB total capacity; 13.80 GiB already allocated; 23.12 MiB free; 14.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is code to reproduce:

import torch
from torchvision import transforms
from PIL import Image

m = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
m.eval()
m.cuda()

pil2tensor = transforms.ToTensor()
files = [] # 100 file paths to png files
x = torch.stack([pil2tensor(Image.open(f).resize((14*20,14*20))) for f in files])
x.shape == (100, 3, 280, 280)

bs = 10 # tried many different batch sizes
x_embs = []
for i in range(0, len(files), bs):
    batch = x[i:i+bs]
    batch = batch.cuda()
    x_emb = m(batch)
    x_embs.append(x_emb.cpu())

Usually it fails on the x_emb = m(batch) line, which is when running the model inference. I tried different batches. I tried calling torch.cuda.empty_cache() everywhere, but nothing helps.

Any advice on how to figure out why is so much memory “reserved”?

It works fine on CPU.

Hello,

The peak memory usage of a network most often exceeds the size of the model and input data since there are intermediate computations that necessitate greater memory usage. For instance, in a neural net containing a linear layer followed by ReLU, the output of the fully-connected layer will increase memory consumption because it is kept in memory prior to being fed to the activation function. Another example would be skip connections; essentially, the model is saving a copy of the input data to use later, thereby occupying more memory. Techniques such as fusion alleviate but do not completely eliminate this.

This phenomenon becomes a true memory burden during backpropagation because the forward pass must retain the result of many intermediate computations that are required by the backward pass. However, given that your application is inference and thus does not involve backpropagation, storing these values is redundant. To turn this behaviour off, you should surround the relevant bits of your code in with torch.no_grad(), as demonstrated below.

import torch
from torchvision import transforms
from PIL import Image

m = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
m.eval()
m.cuda()

pil2tensor = transforms.ToTensor()
files = [] # 100 file paths to png files
x = torch.stack([pil2tensor(Image.open(f).resize((14*20,14*20))) for f in files])
x.shape == (100, 3, 280, 280)

bs = 10 # tried many different batch sizes
x_embs = []
with torch.no_grad():
    for i in range(0, len(files), bs):
        batch = x[i:i+bs]
        batch = batch.cuda()
        x_emb = m(batch)
        x_embs.append(x_emb.cpu())

Should this not resolve your issue, could you run the model on the CPU and report the CPU’s memory usage?

On a separate note, you are transferring each batch to the GPU individually; it may be faster to do it in one go, that is, adding x = x.cuda() outside the loop.