A guide to recovering from CUDA Out of Memory and other exceptions

stas · January 26, 2019, 3:57am

This thread is to explain and help sort out the situations when an exception happens in a jupyter notebook and a user can’t do anything else without restarting the kernel and re-running the notebook from scratch. This usually happens when CUDA Out of Memory exception happens, but it can happen with any exception.

Please read the guide https://docs.fast.ai/troubleshoot.html#memory-leakage-on-exception and if you have any questions or difficulties with applying the information please ask the questions in this dedicated thread.

If you want to skip reading the guide, fastai-1.0.42 or higher has a built-in workaround just for the CUDA Out of Memory, so if you update your fastai install, chances are you’re already taken care of.

aksg87 · August 16, 2019, 4:16pm

Hello @stas

Any advice if you still receive the CUDA Out of Memory error?

I tried creating a cell in my notebook and running 1/0. I am assuming it’s from another running kernal. Is htop on terminal the best way to find processes that are taking up resources with a manual kill command?

srjamali · August 16, 2019, 4:48pm

use torch.cuda.empty_cache()
it will save some memory for you.

aksg87 · August 16, 2019, 7:27pm

Thanks @srjamali

I tried that but psutil still showing 64% of memory being used. Am I missing something?

aksg87 · August 16, 2019, 7:36pm

Update: I just stopped the Jupyter notebook and started it again which cleared out the memory usage!

srjamali · August 17, 2019, 7:26am

Whenever you want to see which process is using the gpu. Type on the terminal in linux
nvidia-smi

This command will show you gpu memory usage and process ids which are using it.
After that use
sudo pwdx process_id
To get details of a process
If this process is unnecessary Use
sudo kill -SIGKILL process_id
To kill any unnecessary process which is using your gpu.

aksg87 · August 17, 2019, 11:20pm

Thanks!

enr · November 7, 2019, 4:33pm

Does somebody solved this without restarting jupyter notebook?

learn2wong · December 30, 2019, 7:51pm

Not sure if you still need this but can try the following from here: DL on a shoestring

Most of the time, the following code will also free it but I am not sure this is what you want as it deletes the learner object. Be sure to watch your performance monitor (if you’re on windows, just open on Task manager > Performance > GPU and see if it drops after deletion)

# use the with_opt flag if you are not interested in saving optim state as well
learner.save("temp_model", with_opt=False)
learner.destroy()
gc.collect()

Then, just re-create your learner object and try load it in. Sometimes it works and sometimes it doesn’t. I have been running into a lot of memory issues recently and this solved it sometimes.

bsalita · April 15, 2020, 2:50pm

I’ve discovered some errors which were resolved by removing .fastai directory. The issues were discovered running course_v4 01_intro.ipynb in docker but aren’t limited to either. I created an issue on github fastai2.

Removing .fastai directory resolved the following errors:

“CUDA out of memory error”
“list index out of range” when data loading, probably due to a defective cache.

Both issues were resolved by executing:
rm -rf $HOME/.fastai

lenux · April 16, 2021, 2:17pm

thanks this really helped me!!