Hi, Stas –
I think this is great, but I’m wondering if only checking actual available RAM is the best approach. I say this because, as I’ve been going through the lessons, I found that sometimes my GPU RAM got tied up even though nothing important was happening in the notebook – somehow the Python process got stuck (and not necessarily because CUDA OOM exceptions occurred beforehand).
I found that I could use nvidia-smi
, find the process ID that was taking up RAM, and then kill {process_id}
at the terminal to free up the resources. Of course, this would reset my kernel, but I was trying to do that anyway in the notebook and it wasn’t working.
Just a thought. What do you think?
EDIT: I should say that as far as the application you’re describing, I think your solution is necessary for those who can’t even run cells that require more GPU RAM than they have. I’m more speaking to the circumstance where one might have the GPU RAM available if they killed processes that weren’t actually doing anything useful.