How to prevent Cuda out of memory errors in Lesson 4?

neves · March 20, 2018, 1:12am

After a lot of problems to get the Lesson 4 Sentiment Notebook (lesson4-imdb.ipynb) to run in my Windows 10 machine, now I’m stuck in a out of cuda memory error.

When it starts to train the Sentiment classifier, in this cell:

m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

it always fails with the message (complete stack trace at the end of the message):

RuntimeError: cuda runtime error (2) : out of memory at c:\anaconda2\conda-bld\pytorch_1519501749874\work\torch\lib\thc\generic/THCStorage.cu:58

I have a Nvidia GTX 1060 with 6Gb of memory. Usually reducing the batch size allows me to run the models, but this time it wasn’t enough. I changed from the commented value to the below, but still get the error. :

#bs=64; bptt=70
bs=32;bptt=30

I’m also mantaining these parameters set:

em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

If I change these parameters the leaner.fit call fails in the Train section fails also with an out of memory cuda error. it looks like that a lot of info is being cached in notebook. If I change it and try to run just the Sentiment section, I also get an out of memory.

Please, would someone give me some orientations about how should I change my parameters so that I can run it in my 6Gb gpu?

Ops, I’ve just lost my stack trace. I’ll run it again and post it here tomorrow.

Chris_Palmer · April 11, 2018, 8:45pm

Hi @neves - I am interested if you worked through this. I have an even worse GPU that the 1060 (a 650 Ti 2G) and was wondering if I could “upgrade” my limited capacity system to 1060 or 1070. So to find that the NLP is producing out of memory issues on a 6G card is interesting to me.

castilla · May 25, 2018, 10:29am

@neves I am having the same problem with a 8GB 1070ti. The memory load stays around 7.65 GB during the process and then the error happens. It looks like the standard parameters in the lesson requires more memory.

dangraf · June 4, 2018, 6:21pm

I’m experiencing the same problem with memory. When watching nvidia-smi it seems like the ram usage is around 7.65 for me too. And the batchsize is lowerd from bs=64 to bs=16, still the same problem.
I have also run the command “watch -n 5 free -m” to find out it the problem is ram or GPU memory that fails, but both seem stable for a while and then suddenly after 3-4 minutes, it fails.
The grapic card is nvidia quatro p4000, 8GB of gpu memory and 64GB of ram.
any suggestions?

alecrubin · June 5, 2018, 12:05am

Try lowering your max sequence length.

dangraf · June 5, 2018, 7:54am

Thanks! This is interesting.
When changeing batchsize from 64 down to 16 the error persists.
But If the batchsize is kept at 64 and just lower the bptt from 70 to 65, it works just fine!

martijnd · June 22, 2018, 2:35pm

Thx a lot. I had the same issue in lesson 10 (running out of memory) lowering the bptt to 60 works fine on my GTX 1070.

rudym · July 13, 2018, 5:03am

Just ran this lesson and my server with a Tesla P100 shows memory usage as 10585MiB which is 94% usage on that single card. Each epoch is taking about 20 minutes, so about 5 hours to run.
The python3 process is using about 100 GB of RAM on my server.

I’d be more interested in lowering the RAM usage on the python which causes my desktop to error out, my GPU on that system is a Titan V with 12 GB of ram. This machine only has 64 GB of RAM, which is the cause of my errors on Jupyter.