CUDA out of memory after a few epochs on U-Nets

sebderhy · April 20, 2021, 8:38pm

Hi,

It seems like there is a memory issue when training U-Nets segmentation models (I observed the following error both in this notebook, and in fastai course’s CamVid notebook)

Basically, the training loop starts normally, but after a few epochs, I get a CUDA out of memory error. I think this issue started happening since one of the recent fastai updates. Has someone else experienced this issue? Could there be a memory cleaning issue or something similar during the training loop with a U-Net?

Thanks

BobMcDear · April 23, 2021, 3:29pm

The U-Net architecture is quite memory-hungry, so it might very well be that your model/input resolution is too large and/or your GPU doesn’t have enough memory. However, if you’re following the two notebooks line by line and have over 8 GB of memory, the issue might be stemming from a bug somewhere.

sebderhy · April 25, 2021, 7:14am

Well I used a very low batch size (2 I think) and I have 16GB in my GPU. The reason I’m thinking the problem is not on my GPU is that the training works for a few epochs, and then gives a CUDA error for some reason.

BobMcDear · April 26, 2021, 12:07pm

If your only concern is running out of memory after a few epochs rather than at the very beginning, then that is normal. CUDA’s caching isn’t flawless, so the memory usage might slightly increase over time and if you’re pushing the limits of your VRAM, you might get a memory limit after a while. For instance, suppose in a perfect environment, running a forward pass with a batch size of 16 would occupy 15.8 GB of memory, and you have access to only 16 GB. When you begin training, in the first few iterations, your memory usage wouldn’t exceed 15.8 GB and all would be well. However, after many more batches, your usage would slowly climb up to 15.85 GB, then 15.9 GB, and finally 16 GB, leading to an error. You will not go from 10 GB to 16 GB like this, but a few hundred megabytes is expected. The two fixes I know of are Python’s garbage collector and restarting your runtime every few epochs.

I tested the notebooks you’ve linked and encountered no errors, so the problem isn’t fastai and likely a mistake in your code somewhere. If you are copy & pasting the code so are sure there are no bugs on your side, try updating fastai and see if the issue persists.

Have a nice day!