NLP Classification - imdb out of memory


(Gerardo Garcia) #1

Today is the third day that I try to follow the notebook step by step and I always get out of memory.
I have the latest code and I’m using a 1080 ti NVIDIA Card.

I have tried changing

bptt=30
bs=10

Is there any other variable that I need to modify?


(Nick) #2

bptt=30 and bs=30 was sufficient for me to complete training on a 1070.


(Thomas) #3

Do you have only one model? In the notebook there is learner vs. learn and my impression was that something stays around there. If you are on PyTorch 0.4+ you might also see about using with torch.no_grad(): for evaluation.
In recent PyTorch you can check with torch.cuda.memory_allocated() how much of your GPU memory is used (and not just cached). This should ideally be the memory for the weights. It’ll go up during the forward passes during training, but should be lower again after backward.
I hope this is useful for you, I never dug all that deep into it but just reloaded the kernel and loaded the model I had up to then and continued.

Best regards

Thomas


(Gerardo Garcia) #4

bptt=30 and bs=30 did not work completely on 1080 ti
I’m trying right now

bptt=30, bs=20


(Gerardo Garcia) #5

I changed those learn to learner and the issue persist.
At this point what I’m trying to do is to run the whole notebook in full to give me an idea what I need to do after.

Any suggestions?

I’m running NVidia-smi command to find out the status of the GPU, temp and fan speed.


(Nick) #6

How far did you get before it crashed? It’s a while since I did it, but ULMFit memory issues has me asking about it at the final .fit()

I guess try reducing both even further the step before the crash occurs.

(I’d note that it is possible the notebook has changed, but I was able to run the IMDB scripts last week with a similar setting)


(Gerardo Garcia) #7

I tried with bptt=20, bs=20 seems to be working
the memory is now 5Gb of the 8 GB available on the card

Now, I’m getting an error with the tensor
Any ideas?
notebook.pdf (414.0 KB)


(Nick) #8

Wrong version of Pytorch maybe?


(Gerardo Garcia) #9

I have the latest
This is frustrating.
:rage:


(Gerardo Garcia) #10

I think I found the issue
1080 ti uses 11 GB of RAM

AWS is using K80 - 24 GB
Nvidia - K80
or
AWS P3 using Tesla V100 - 32 GB
NVIDIA - Tesla V100


(Arnav) #11

As a last attempt, you could try and install 0.3 version of pytorch. That is the version the course was built on.
It’s been a while since I ran it as well but afair it ran on Colab as well so you could try that.


(Gerardo Garcia) #12

Is this the latest version?

version = ‘0.3.1.post2’
debug = False
cuda = ‘9.0’