This thread is to explain and help sort out the situations when an exception happens in a jupyter notebook and a user can’t do anything else without restarting the kernel and re-running the notebook from scratch. This usually happens when CUDA Out of Memory exception happens, but it can happen with any exception.
If you want to skip reading the guide, fastai-1.0.42 or higher has a built-in workaround just for the CUDA Out of Memory, so if you update your fastai install, chances are you’re already taken care of.
Any advice if you still receive the CUDA Out of Memory error?
I tried creating a cell in my notebook and running 1/0. I am assuming it’s from another running kernal. Is htop on terminal the best way to find processes that are taking up resources with a manual kill command?
Whenever you want to see which process is using the gpu. Type on the terminal in linux
This command will show you gpu memory usage and process ids which are using it.
After that use
sudo pwdx process_id
To get details of a process
If this process is unnecessary Use
sudo kill -SIGKILL process_id
To kill any unnecessary process which is using your gpu.
Most of the time, the following code will also free it but I am not sure this is what you want as it deletes the learner object. Be sure to watch your performance monitor (if you’re on windows, just open on Task manager > Performance > GPU and see if it drops after deletion)
# use the with_opt flag if you are not interested in saving optim state as well
Then, just re-create your learner object and try load it in. Sometimes it works and sometimes it doesn’t. I have been running into a lot of memory issues recently and this solved it sometimes.
I’ve discovered some errors which were resolved by removing .fastai directory. The issues were discovered running course_v4 01_intro.ipynb in docker but aren’t limited to either. I created an issue on github fastai2.
Removing .fastai directory resolved the following errors:
“CUDA out of memory error”
“list index out of range” when data loading, probably due to a defective cache.
Both issues were resolved by executing:
rm -rf $HOME/.fastai