fast.ai Course Forums

Cuda memory error

akashgshastri (Akash ) July 14, 2019, 12:09pm 1

I’m training a model and i’m getting a run time error that says cuda out of memory. However in the error log there is clearly more free space than necessary. Can someone tell me whats happening?

I have a 4 Gb 1050ti

Error log:

RuntimeError Traceback (most recent call last)
in
----> 1 learn.fit_one_cycle(2)

D:\Program_files\lib\site-packages\fastai\train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
20 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
21 final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
—> 22 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
23
24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, wd:float=None):

D:\Program_files\lib\site-packages\fastai\basic_train.py in fit(self, epochs, lr, wd, callbacks)
198 callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
199 if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
–> 200 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
201
202 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

D:\Program_files\lib\site-packages\fastai\basic_train.py in fit(epochs, learn, callbacks, metrics)
99 for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
100 xb, yb = cb_handler.on_batch_begin(xb, yb)
–> 101 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
102 if cb_handler.on_batch_end(loss): break
103

D:\Program_files\lib\site-packages\fastai\basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
24 if not is_listy(xb): xb = [xb]
25 if not is_listy(yb): yb = [yb]
—> 26 out = model(*xb)
27 out = cb_handler.on_loss_begin(out)
28

D:\Program_files\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
–> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

D:\Program_files\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
90 def forward(self, input):
91 for module in self._modules.values():
—> 92 input = module(input)
93 return input
94

D:\Program_files\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
–> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

D:\Program_files\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
90 def forward(self, input):
91 for module in self._modules.values():
—> 92 input = module(input)
93 return input
94

D:\Program_files\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
–> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

D:\Program_files\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
90 def forward(self, input):
91 for module in self._modules.values():
—> 92 input = module(input)
93 return input
94

D:\Program_files\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
–> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

D:\Program_files\lib\site-packages\torchvision\models\resnet.py in forward(self, x)
53 identity = x
54
—> 55 out = self.conv1(x)
56 out = self.bn1(out)
57 out = self.relu(out)

D:\Program_files\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
–> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

D:\Program_files\lib\site-packages\torch\nn\modules\conv.py in forward(self, input)
336 _pair(0), self.dilation, self.groups)
337 return F.conv2d(input, self.weight, self.bias, self.stride,
–> 338 self.padding, self.dilation, self.groups)
339
340

RuntimeError: CUDA out of memory. Tried to allocate 570.00 MiB (GPU 0; 4.00 GiB total capacity; 164.56 MiB already allocated; 2.78 GiB free; 13.44 MiB cached)

strickvl (Alex Strick van Linschoten) August 28, 2021, 7:52am 2

Did you ever figure this out? I am encountering the same problem.

Interogativ (Bart Fish) August 30, 2021, 8:06pm 3

I have found NVIDIA’s nvtop (a graphic version of nvidia-smi) to be a great way to watch how CUDA memory is allocated in real time and to see how much CUDA memory your program is actually using. A 4GB card is really not too useful has most models even with small batch sizes use over 6GB.