fast.ai Course Forums

CUDA out of memory with single-image dataset

BelAir October 31, 2019, 7:27pm 1

I am trying to adapt the code from lesson 3 to my own dataset, and get a ‘CUDA out of memory’ error when trying to run lr_find()

Similar issues are discussed here: CUDA out of memory and here: CUDA Out of memory (GPU) issue while lr_find, but the proposals there didn’t help me.

I have reduced batch_size to 1, and also reduced my data so the databunch consists of only a single image. Furthermore I have changed to a smaller model: resnet34 to resnet18.

Thus, the error seems to lie somewhere else. Does anybody have an idea?

Error log:

RuntimeError Traceback (most recent call last)
~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
100 xb, yb = cb_handler.on_batch_begin(xb, yb)
–> 101 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
102 if cb_handler.on_batch_end(loss): break

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
25 if not is_listy(yb): yb = [yb]
—> 26 out = model(*xb)
27 out = cb_handler.on_loss_begin(out)

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
540 else:
–> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/layers.py in forward(self, x)
135 res.orig = x
–> 136 nres = l(res)
137 # We have to remove res.orig to avoid hanging refs and therefore memory leaks

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
540 else:
–> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
91 for module in self._modules.values():
—> 92 input = module(input)
93 return input

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
540 else:
–> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
91 for module in self._modules.values():
—> 92 input = module(input)
93 return input

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
540 else:
–> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torchvision/models/resnet.py in forward(self, x)
58
—> 59 out = self.conv1(x)
60 out = self.bn1(out)

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
540 else:
–> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
344 def forward(self, input):
–> 345 return self.conv2d_forward(input, self.weight)
346

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/conv.py in conv2d_forward(self, input, weight)
341 return F.conv2d(input, weight, self.bias, self.stride,
–> 342 self.padding, self.dilation, self.groups)
343

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 10.76 GiB total capacity; 9.75 GiB already allocated; 10.44 MiB free; 89.12 MiB cached)

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
in
----> 1 lr_find(learn)

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, wd)
39 cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
40 epochs = int(np.ceil(num_it/len(learn.data.train_dl)))
—> 41 learn.fit(epochs, start_lr, callbacks=[cb], wd=wd)
42
43 def to_fp16(learn:Learner, loss_scale:float=None, max_noskip:int=1000, dynamic:bool=True, clip:float=None,

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
198 else: self.opt.lr,self.opt.wd = lr,wd
199 callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
–> 200 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
201
202 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
110 exception = e
111 raise
–> 112 finally: cb_handler.on_train_end(exception)
113
114 loss_func_name2activ = {‘cross_entropy_loss’: F.softmax, ‘nll_loss’: torch.exp, ‘poisson_nll_loss’: torch.exp,

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/callback.py in on_train_end(self, exception)
321 def on_train_end(self, exception:Union[bool,Exception])->None:
322 “Handle end of training, exception is an Exception or False if no exceptions during training.”
–> 323 self(‘train_end’, exception=exception)
324
325 @property

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/callback.py in call(self, cb_name, call_mets, **kwargs)
249 if call_mets:
250 for met in self.metrics: self._call_and_update(met, cb_name, **kwargs)
–> 251 for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
252
253 def set_dl(self, dl:DataLoader):

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/callback.py in _call_and_update(self, cb, cb_name, **kwargs)
239 def call_and_update(self, cb, cb_name, **kwargs)->None:
240 “Call cb_name on cb and update the inner state.”
–> 241 new = ifnone(getattr(cb, f’on{cb_name}’)(**self.state_dict, **kwargs), dict())
242 for k,v in new.items():
243 if k not in self.state_dict:

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/callbacks/lr_finder.py in on_train_end(self, **kwargs)
33 def on_train_end(self, **kwargs:Any)->None:
34 “Cleanup learn model weights disturbed during LRFinder exploration.”
—> 35 self.learn.load(‘tmp’, purge=False)
36 if hasattr(self.learn.model, ‘reset’): self.learn.model.reset()
37 for cb in self.callbacks:

~/anaconda3/envs/myenv/lib/python3.7/site-packages/fastai/basic_train.py in load(self, file, device, strict, with_opt, purge, remove_module)
265 elif isinstance(device, int): device = torch.device(‘cuda’, device)
266 source = self.path/self.model_dir/f’{file}.pth’ if is_pathlike(file) else file
–> 267 state = torch.load(source, map_location=device)
268 if set(state.keys()) == {‘model’, ‘opt’}:
269 model_state = state[‘model’]

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
424 if sys.version_info >= (3, 0) and ‘encoding’ not in pickle_load_args.keys():
425 pickle_load_args[‘encoding’] = ‘utf-8’
–> 426 return _load(f, map_location, pickle_module, **pickle_load_args)
427 finally:
428 if new_fd:

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
611 unpickler = pickle_module.Unpickler(f, **pickle_load_args)
612 unpickler.persistent_load = persistent_load
–> 613 result = unpickler.load()
614
615 deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/serialization.py in persistent_load(saved_id)
574 obj = data_type(size)
575 obj._torch_load_uninitialized = True
–> 576 deserialized_objects[root_key] = restore_location(obj, location)
577 storage = deserialized_objects[root_key]
578 if view_metadata is not None:

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/serialization.py in restore_location(storage, location)
444 elif isinstance(map_location, torch.device):
445 def restore_location(storage, location):
–> 446 return default_restore_location(storage, str(map_location))
447 else:
448 def restore_location(storage, location):

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/serialization.py in default_restore_location(storage, location)
153 def default_restore_location(storage, location):
154 for _, _, fn in _package_registry:
–> 155 result = fn(storage, location)
156 if result is not None:
157 return result

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/serialization.py in _cuda_deserialize(obj, location)
133 storage_type = getattr(torch.cuda, type(obj).name)
134 with torch.cuda.device(device):
–> 135 return storage_type(obj.size())
136 else:
137 return obj.cuda(device)

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.78 GiB already allocated; 6.44 MiB free; 66.96 MiB cached)

BelAir October 31, 2019, 8:05pm 2

For some reason, I was missing the ‘size’ parameter when creating my databunch, even though I had calculated earlier in the notebook.

Now, after downsizing my images appropriately, I get yet another error:
CUDA error: device-side assert triggered

BelAir November 1, 2019, 9:46pm 3

Update:
I was able to also work around the second error, by changing the target-labels.

I didn’t originally have my masks as .pngs as in the course, but they instead came directly from a dataframe as runlength-encodings. I converted them to numpy-arrays of zeros and ones, and stored these as .png files.
When loading the masks with open_mask(), while they looked just fine when visualized, I realized by looking at the .data that the loaded masks contained the values 0 and 255 instead of 0 and 1. I guessed that when loaded via .label_from_func(), the masks would also have these values, so I added 300 labels for the mask (in addition to the background-label). Now it works. This is a very unseemly workaround and I intend to find a proper solution.

juvian (julian) November 4, 2019, 1:04am 4

open_mask(div=True) will make it 0 and 1 instead of 0 and 255