My GPU seems to be running out of memory when I try to run large images (480,640) through my model. Now I know this can be expected, however my task manager suggests that I still have plenty of memory available.
Soometime the cuda runtime error also says I am trying to access say 2GB of memory when only 4GB is availble. This doesn’t make sense to me.
For reference I am running the latest version of fastaiv1 of a windows 10 machine.
My GPU is an Nvidia GeForce GTX 1080 Ti.
As soon as you start training your model (
learn.fit_one_cycle), try monitoring your GPU using
nvidia-smi by executing the following command from a separate Windows console. It will show you the GPU usage stats:
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 5
I’ve run this and I seem to have the same issue, I have memory available on the GPU but it doesn’t seem to get used before the error arises. For reference look at my screenshot.
It could be because PyTorch needs to allocate a 2Gb block, so even if you have more free memory there may not be a single 2Gb block (that is a big block). If this is the only GPU in your system then there may be lots of other things using little bits of memory that over time become spread out (PyTorch should generally ensure it doesn’t spread out too much).
It’s also not uncommon to see cases (even on Linux, with nothing else accessing the GPU), where it says it’s trying to allocate less than the available space but fails. Either because of the issue of needing a single block, or just because of the vagaries of CUDA, there’s lots of things going on and stats aren’t necessarily entirely accurate.
Not sure about windows but on linux nvidia-smi will show the programs using memory so you can close them.
If you have integrated graphics on your MB as well as the GPU you’re using GPU then you might look into switching windows to the integrated graphics while doing DL as this should help with interference. There’s various tools to handle this sort of switching.
If you are using progressive resizing, you should call
learn.purge after you have changed to a different image size, also after the
What are the your batch size and image size?
480,640- batch size of 20~
why peruge after unfreezing?
I run the unfreeze call fairly often as I have it in the same cell as my fit_one_cycle call.
You may check out this documentaion
Here below some information about
learn.purge() extracted from this discussion
learn.purge() removes any of the Learner guts that are no longer needed and reloads the model on GPU, which also helps to reduce memory fragmentation
learn.purge() before any big change in your model training (image size, unfreeze, etc.).