NB_10_nlp notebook may have a memory leak

RogerS49 · March 21, 2020, 10:03am

NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1 

 0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |  10559MiB / 11178MiB |      0%      Default 

  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18141      C   /home/dl/anaconda3/envs/fastai2/bin/python 10547MiB |

After closing NB 10 which cleared memory of GPU
Ran both NB 11 and NB 12

0 GeForce GTX 108… Off | 00000000:02:00.0 Off | N/A |
| 23% 21C P8 8W / 250W | 501MiB / 11178MiB |

In the NB 10 after modifying SentencePieceTrainer.Train input first line by removing the {q}'s
We get a

RuntimeError: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 0; 10.92 GiB total capacity; 7.50 GiB already allocated; 619.44 MiB free; 9.79 GiB reserved in total by PyTorch)

This occurs at

Fine tuning the language model

learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 2e-2)

epoch train_loss valid_loss accuracy perplexity time
0 20413316.000000 00:01

RuntimeError Traceback (most recent call last)
in
----> 1 learn.fit_one_cycle(1, 2e-2)

~/fastai-2020/fastai2/fastai2/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
110 scheds = {‘lr’: combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
111 ‘mom’: combined_cos(pct_start, *(self.moms if moms is None else moms))}
→ 112 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
113
114 # Cell

~/fastai-2020/fastai2/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
188 try:
189 self.epoch=epoch; self(‘begin_epoch’)
→ 190 self._do_epoch_train()
191 self._do_epoch_validate()
192 except CancelEpochException: self(‘after_cancel_epoch’)

~/fastai-2020/fastai2/fastai2/learner.py in _do_epoch_train(self)
161 try:
162 self.dl = self.dls.train; self(‘begin_train’)
→ 163 self.all_batches()
164 except CancelTrainException: self(‘after_cancel_train’)
165 finally: self(‘after_train’)

~/fastai-2020/fastai2/fastai2/learner.py in all_batches(self)
139 def all_batches(self):
140 self.n_iter = len(self.dl)
→ 141 for o in enumerate(self.dl): self.one_batch(*o)
142
143 def one_batch(self, i, b):

~/fastai-2020/fastai2/fastai2/learner.py in one_batch(self, i, b)
149 self.loss = self.loss_func(self.pred, *self.yb); self(‘after_loss’)
150 if not self.training: return
→ 151 self.loss.backward(); self(‘after_backward’)
152 self.opt.step(); self(‘after_step’)
153 self.opt.zero_grad()

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
193 products. Defaults to False.
194 “”"
→ 195 torch.autograd.backward(self, gradient, retain_graph, create_graph)
196
197 def register_hook(self, hook):

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
—> 99 allow_unreachable=True) # allow_unreachable flag
100
101

RuntimeError: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 0; 10.92 GiB total capacity; 7.50 GiB already allocated; 619.44 MiB free; 9.79 GiB reserved in total by PyTorch)

Suspicion is sentencepiece?

RogerS49 · March 25, 2020, 7:41am

This post should be ignored. The memory leak

was due to something in and around the to_fp16() and perhaps other issues being worked in the background as the fastai2 team work to complete the release.

Also I should have read more closely the fastbook chapter.

I was able to move forward by removing that part of the call as it is intended not for GPU use.

As this is a post that may confuse people it should be deleted but I don’t have the permissions.

RogerS49 · March 26, 2020, 8:50am

I have an issue now where my id:0 GPU is not being utilised. I have 2 GPUs id:1 is used for the screen but has GPU capabilities and 2GB memory .
GPU id:0 has in the region 11GB but now not being utilised since I removed the .to_fp16()

I change my environment most daily and git pull fastcore, fastai2, and course-v4 and pip uninstall and install the libraries using pip install -e ".[dev]"

As can be seen it was utilised before in the print out but although the notebook works it uses cpu only

my pytorch is

version 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0

RogerS49 · March 27, 2020, 8:47am

Here is my solution and many thanks to the original contributor to this post.

Post Regarding Local Environment Installs with GPUs

jquintanilla4 · October 15, 2021, 6:32am

How did you solve it? code-wise… I’ve just run into this problem and i’m running my jupyter notebook on paperspace gradient. So I know it’s not a local gpu issue. I’ve read multiple post on how to fix this issue and have not been able to move forward, nothing works.

Any help would be appreciated.

jquintanilla4 · October 18, 2021, 6:17am

For anyone who runs into this problem. The way I solved it was to reduce the bs (batch size) by half in dls_lm, it worked without issue after that. The accuracy only dropped by a point, not the best outcome, but it allowed me to continue without getting a runtime error.

Update: I’m not sure if this is related to a smaller batch size or if it’s to do with using paperspace. But later on in the chapter – Fine-Tuning the Classifier – when I load ‘finetune’ and try to fine tune it via the:

learn.fit_one_cycle(1, 2e-2)

I get the error:

/opt/conda/envs/fastai/lib/python3.8/site-packages/fastprogress/fastprogress.py:74: UserWarning: Your generator is empty.
  warn("Your generator is empty.")

I’ve tried everything to solve both issues, googled, poked around the code. No luck. So yeah, If I make my initial batch size too big and try redo the 10 epoch training, then I run into the “CUDA out of memory” problem. But if I half the batch size, then later on, when I’m fine tuning the classifier, I run into the “Your generator is empty”… or maybe it’s just another problem altogether.

Anyone got ideas?

victorbahlangene · July 15, 2022, 1:57am

Hi. I had the same error. Have you figured it out?