Lesson10 classification part running out of memory

YangL · April 9, 2018, 6:12am

This is my hardware:
Paperspace, p40000
RAM: 30 GB
CPUS: 8
HD: 210.5 KB / 100 GB
GPU: 8 GB

… So, everything is fine, ish.

The language model takes about 45 min per cycle, so took a night to train.

The classification part is less fortunate.
I get
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCStorage.cu:58

Every time I train the full model.

When I freeze the last layers, (learn.unfreeze()), the learner does not fit due to running out of memory.

learner does not run out of memory if I freeze some of the layers (i.e., learn.freeze_to(4) trains fine with bs=48. )

I have tried from batch size = 48 to 4, and still running out of memory.

additionally, I got this error as well (bs=4), not sure if it’s relevant.

Exception in thread Thread-4:

Traceback (most recent call last):
File “/home/paperspace/anaconda3/envs/fastai/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py”, line 144, in run
for instance in self.tqdm_cls._instances:
File “/home/paperspace/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py”, line 60, in iter
for itemref in self.data:
RuntimeError: Set changed size during iteration

Anybody else having same/similar trouble?
What are your set ups?

travis · April 10, 2018, 8:51am

Yes, I’ve had similar problems. I’m on the same Paperspace instance as you and getting similar times per cycle. I’ve not actually been able to complete the Language Model. I’ve not gotten the memory errors, but training time is just way too long. I’ve tried letting it run through the night, but it hung up with no message as to what occurred.

For now, I’ve chosen to move on to other topics. I’ve left it at re-creating the notebooks for learning purposes, but not actually seeing the model complete the full cycles on fit. I have not yet tried adjusting batch size or any other parameters.

I know that’s not particularly helpful, but I just wanted you to know that you’re not alone with these problems.

jeremy · April 10, 2018, 4:15pm

When creating the classifier there’s a param that’s set to 20*70 in the notebook. Change it to 10*70 to halve (approx) the memory requirements.

shoof · April 10, 2018, 4:49pm

My own box was similar to yours and I had to set bs=10 to avoid the CUDA memory issue. Didn’t change the param to 10*70

jeremy · April 10, 2018, 4:56pm

Setting bs that low probably reduces performance more than changing the max sl param I suggested - and certainly makes it much slower to train. Might be worth trying - if you do, tell us what you find!

YangL · April 11, 2018, 9:49pm

Just so that you know, it works.

Thanks for saving the day again!

Interogativ · April 12, 2018, 12:52am

I was able to run with BS at 16 using 20*70 on my linux box with 8GB CUDA RAM and 64 GB CPU RAM

YangL · April 13, 2018, 12:42am

haha. I have 30gb ram.

I’m upgrading to P5000 on paperspace soon, so hope it won’t be an issue after that.

anshbansal · April 14, 2018, 4:58pm

Not sure if related but I set bs = 10 while training the language model on my local 4GB GPU and it was showing more than 3 hours for 1 epoch. So really slow as you mentioned.