A guide to recovering from CUDA Out of Memory and other exceptions

This thread is to explain and help sort out the situations when an exception happens in a jupyter notebook and a user can’t do anything else without restarting the kernel and re-running the notebook from scratch. This usually happens when CUDA Out of Memory exception happens, but it can happen with any exception.

Please read the guide https://docs.fast.ai/troubleshoot.html#memory-leakage-on-exception and if you have any questions or difficulties with applying the information please ask the questions in this dedicated thread.

If you want to skip reading the guide, fastai-1.0.42 or higher has a built-in workaround just for the CUDA Out of Memory, so if you update your fastai install, chances are you’re already taken care of.

14 Likes

Hello @stas

Any advice if you still receive the CUDA Out of Memory error?

I tried creating a cell in my notebook and running 1/0. I am assuming it’s from another running kernal. Is htop on terminal the best way to find processes that are taking up resources with a manual kill command?

1 Like

use torch.cuda.empty_cache()
it will save some memory for you.

Thanks @srjamali

I tried that but psutil still showing 64% of memory being used. Am I missing something?

Update: I just stopped the Jupyter notebook and started it again which cleared out the memory usage!

Whenever you want to see which process is using the gpu. Type on the terminal in linux
nvidia-smi

This command will show you gpu memory usage and process ids which are using it.
After that use
sudo pwdx process_id
To get details of a process
If this process is unnecessary Use
sudo kill -SIGKILL process_id
To kill any unnecessary process which is using your gpu.

4 Likes

Thanks!

Does somebody solved this without restarting jupyter notebook?

Not sure if you still need this but can try the following from here: DL on a shoestring

Most of the time, the following code will also free it but I am not sure this is what you want as it deletes the learner object. Be sure to watch your performance monitor (if you’re on windows, just open on Task manager > Performance > GPU and see if it drops after deletion)

# use the with_opt flag if you are not interested in saving optim state as well
learner.save("temp_model", with_opt=False)
learner.destroy()
gc.collect()

Then, just re-create your learner object and try load it in. Sometimes it works and sometimes it doesn’t. I have been running into a lot of memory issues recently and this solved it sometimes.

1 Like

I’ve discovered some errors which were resolved by removing .fastai directory. The issues were discovered running course_v4 01_intro.ipynb in docker but aren’t limited to either. I created an issue on github fastai2.

Removing .fastai directory resolved the following errors:

  1. “CUDA out of memory error”
  2. “list index out of range” when data loading, probably due to a defective cache.

Both issues were resolved by executing:
rm -rf $HOME/.fastai

1 Like

thanks this really helped me!!