Runtime: CUDA out of memory error during inference

I am trying to run lear.predict() on the test set for the cdiscount Kaggle competition. But I am getting the following error when doing prediction:

y = learn.predict(is_test=True)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-77a687d8e73e> in <module>()
----> 1 log_preds,y = learn.predict(is_test=True)

~/fast_ai_fellowship/fastai/courses/dl1/fastai/learner.py in predict(self, is_test)
231         self.load('tmp')
232 
--> 233     def predict(self, is_test=False): return self.predict_with_targs(is_test)[0]
234 
235     def predict_with_targs(self, is_test=False):

~/fast_ai_fellowship/fastai/courses/dl1/fastai/learner.py in predict_with_targs(self, is_test)
235     def predict_with_targs(self, is_test=False):
236         dl = self.data.test_dl if is_test else self.data.val_dl
--> 237         return predict_with_targs(self.model, dl)
238 
239     def predict_dl(self, dl): return predict_with_targs(self.model, dl)[0]

~/fast_ai_fellowship/fastai/courses/dl1/fastai/model.py in predict_with_targs(m, dl)
116     if hasattr(m, 'reset'): m.reset()
117     res = []
--> 118     for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
119     preda,targa = zip(*res)
120     return to_np(torch.cat(preda)), to_np(torch.cat(targa))

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
222         for hook in self._forward_pre_hooks.values():
223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
225         for hook in self._forward_hooks.values():
226             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
 65     def forward(self, input):
 66         for module in self._modules.values():
---> 67             input = module(input)
 68         return input
 69 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
222         for hook in self._forward_pre_hooks.values():
223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
225         for hook in self._forward_hooks.values():
226             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/activation.py in forward(self, input)
718 
719     def forward(self, input):
--> 720         return F.log_softmax(input)
721 
722     def __repr__(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in log_softmax(input)
535 
536 def log_softmax(input):
--> 537     return _functions.thnn.LogSoftmax.apply(input)
538 
539 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py in forward(ctx, input, *params)
172             del ctx.buffers
173 
--> 174         getattr(ctx._backend, update_output.name)(ctx._backend.library_state, input, output, *args)
175         return output
176 

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCStorage.cu:66

I have the latest fastai code, and updated fastai environment. From the error, I think it is running out of memory while iterating over the test set(line number 118). But assuming the dataloader is only loading minibatches, I don’t know why that should cause memory issues. Anyone else faced this problem?

1 Like

One time I faced this issue is when there were some other Jupyter notebooks open in the background. Once I shutdown those notebooks and refreshed, everything worked well. If they doesn’t work, your gpu may not have enough RAM and you might have to lower your batch size.

1 Like

I don’t have any other notebooks running. But how do you pass the batch size for the test set? Is the only way to set it during the definition of the data variable? The batch size of default 64 worked fine during training, so I am wondering why it would be a problem during inference. I am running on a Nvidia Titan X with 12GB RAM btw.

You may have stray jupyter sessions, check with:
ps aux | grep jupyter

Also correlate that with the processes usign your GPU memory

3 Likes

Thanks. But I made sure by restarting my kernel, and rerun only that specific notebook. Still the same error. I even defined my data variable with a bs of 4. But still GPU is running out of memory. Not sure why.

What is your GPU memory usage before running your notebook?

Only around 300 MB from nvidia-smi.

That is strange @nafizh. As such running out of ideas, aiming in the dark here:

  • How big is your model? I am not sure if there is a direct correlation between size of the model saved on disk vs memory, but how big is your saved model state file on disk?
  • For your test data, can you remove all but one instance from your test set and initiate the prediction?

I also ran into this problem and had solved it by reducing the image sizes to 300x300 from 400x400…(in plants seedlings competition)

Also initial memory consumption was around 500MB in my case and then it jumped all the way to ~11GB…

This is really weird. Now, with the fastai environment, with any notebook, training is not running at all. The kernel just dies. If I do it in my normal environment, then it starts training but after some time the kernel dies again. I wonder if @jeremy has any experience of such issues?

Rebooting AWS instance as last resort has always worked so far for me.

I’ve been seeing a similar error in two cases:

  1. When I used resnet152 instead of resnet34. Switching back to resnet34 solved the mem error.
  2. When I ran 3 copies of the same notebook with different parameters. This started working after reboot.

While running the cells that take a while, I use the Jupyter %time attribute to see how long it runs. Usually, before I get an out of mem error I see that each run takes longer than usual.

It seems this problem was only happening in the fastai environment. Even restarts did not help. Once I got out of the fastai environment to my computer’s, then it started training but the kernel would die after around 1% of training.