Out of CUDA memory when trying to train the NLP notebook from the quickstart

Hi all,

I am trying to learn fastai. I’ve watched lesson 1 and gone thru most of the quickstart guide. The NLP quickstart however, will never finish training. (Stack Trace below)

RuntimeError: CUDA out of memory. Tried to allocate 92.00 MiB (GPU 0; 5.93 GiB total capacity; 4.76 GiB already allocated; 58.00 MiB free; 4.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Using nvidia-smi, the CUDA is at about 2G (out of 6G) right up until it errors, where it suddenly jumps to 6G. I’ve tried lower batch sizes, they just error later. Eventually, it ends up taking around an hour. From what the lesson said this shouldn’t be the case, and it should be fairly quick. I have a Ryzen 7 3700x, 64 GB of ram, and a Nvidia 1060 6GB.

I think the only thing you can try to do is bring down the batch size and see which would work for you. I don’t think 1060 supports fp16 like a 1070ti might but you can try that and see if you can get past the hurdle. With smaller RAM, unfortunately you sort of have to deal with such annoyances. I recall doing that cell in a reasonable time for a 1070ti with tofp16() and a batch size of 16 or maybe even 8. 1070ti is 8GB btw.

Hi, Yes this can happen, the model may increase the memory as the derivatives are being calculated over time.

Pytorch uses the approaches to save the derivatives of every layer from each iteration, and this increases fast the memory usage up to some point the Out of Memory happens. If you detect that your GPU cannot handle the number of generated parameters for your model try to test in a newer GPU like in Google Collab, then see if the error vanishes. After you can see how much memory the GPU there (collab), was used to process everything.

Or as @mike.moloch said you could try to diminish the size of each batch until you have a model that can be handled by your GPU the problem is that as you decrease the batch size, is better to you increase the number of epochs, because your model will have a harder time to converge.

1 Like

If you can’t figure out anything better than decreasing the batch size, you might want to use this gradient accumulation logic that Jeremy talks about in one of his videos: Lesson 7: Practical Deep Learning for Coders 2022 - YouTube

This allows you train the model well even if your GPU can only handle a small batch size. Basically, the batch size you use to update your model doesn’t have to match the batch size that you run physically on the GPU