Learn.get_preds() RuntimeError: CUDA out of memory


I am working on a segmentation problem, using a simple unet model based on resnet 34. I am using fastai version 1.0.55

I am training on images of size 512*512, the training runs fine with a batch size of 32 on 2 GPU’s. But when I calculate predictions using learn.get_preds it fails with below error
“RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB” .

I tried to observe the GPU memory using nvidia-smi, i realized the GPU memory has been increasing, looks like some kind of leakage.I also tried by reducing the batch size but it did not help. learn.get_preds work for images of size 256*256. So I tried to create a simple function which generates predictions. Below is the code

    preds = []
    ys = []
    model = learn.model
    dl = learn.dl(ds_type)
    with torch.no_grad():
        for xb,yb in progress_bar(dl):
            pred = model(xb)
    del pred
    return torch.cat(preds),torch.cat(ys)

Unfortunately something is going wrong in the above code. The results that I get from fastai learn.get_preds and my learn.get_preds function does not match. Any pointers on how to solve the GPU memory issue or if I am missing something in the get_preds function would be highly appreciated.

1 Like

Fastai was probably adding the activation function on top of the predictions (a softmax over the channel dimension).

Thank you so much you saved my day. Applying softmax gave me same results. I have a few more questions, trying to understand why.

  1. Would fastai apply softmax to the models output before passing it to the loss function during the training.
  2. Any function I can look to understand where it is applied.
  3. Why learn.get_preds is throwing gpu memory issues, whereas the get_preds function which is very similar to fastai is not .

Thanks for the quick response again.

No the softmax is applied inside the loss function in PyTorch, that’s why we don’t have it in the model. It’s applied in get_preds by an internal dictionary that maps loss function to final activation.

I don’t know why learn.get_preds is throwing memory error, it’s not supposed to


hmm, I am getting the out of memory error in get_preds as well on a segmentation dataset (In this case, I am using a custom model with just Learner(data, model…))

In the stack trace, the error seems to be thrown on res = [torch.cat(o).cpu() for o in zip(*validate(model, dl, cb_handler=cb_handler, pbar=pbar, average=False, n_batch=n_batch))] from

def get_preds(model:nn.Module, dl:DataLoader, pbar:Optional[PBar]=None, cb_handler:Optional[CallbackHandler]=None,
              activ:nn.Module=None, loss_func:OptLossFunc=None, n_batch:Optional[int]=None) -> List[Tensor]:
    "Tuple of predictions and targets, and optional losses (if `loss_func`) using `dl`, max batches `n_batch`."
    res = [torch.cat(o).cpu() for o in
           zip(*validate(model, dl, cb_handler=cb_handler, pbar=pbar, average=False, n_batch=n_batch))]
    if loss_func is not None:
        with NoneReduceOnCPU(loss_func) as lf: res.append(lf(res[0], res[1]))
    if activ is not None: res[0] = activ(res[0])
    return res

I am wondering-- should it be torch.cat(o.cpu()) rather than torch.cat(o).cpu() so that the tensors don’t get accumulated on the gpu?

That sounds like a better idea yes. Would you mind suggesting a PR with that?

Would you mind suggesting a PR with that?

I would… except I just tried it and realised that o is actually a list so I can’t do o.cpu(). So I thought it might be something I can change in validate (since o is an item in zip(*validate(model,dl,cb_handler=cb_handler,pbar=pbar,average=False,n_batch=n_batch)) ), except now I am even more confused since validate is supposed to return the losses so I am not sure how the get_preds go from losses to predictions? :thinking:

The fix here is probably in validate or the method it calls: loss_batch.

Currently we get val_loss (which actually contains out and targets, not loss)

val_loss = loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler)

One option would be to move them to CPU here:

v1, v2 = loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler)
val_loss = (v1, v2.cpu())

The other option would be to move them to CPU in loss_batch().

So instead of

if not loss_func: return to_detach(out), yb[0].detach()

We could use:

if not loss_func: return to_detach(out), yb[0].detach().cpu()

I’d personally go with the second option. It looks like we’ve already re-purposed loss_batch beyond it’s original purpose (to return a loss) so it feels like it would be safe to move them to the CPU at this point as well.

Edit: After looking a little closer, it’s the yb Tensor in particular that’s causing the problem. The actual outputs are already being detached with to_detach().

In master, yb[0] goes through to_detach as well, which puts it on the CPU, so this should solve the error. Not sure when this change was made so it’s possible it’s not in the latest release.

Whoops, you’re right! You fixed this here: https://github.com/fastai/fastai/issues/2337

@JoshVarty @sgugger had the same issue, then read this = updated library, looks like you fixed it! thank you very much,