Crashing Kernel while running Lesson 3 camvid notebook

aymenim · November 16, 2018, 1:44pm

Running cell

learn = Learner.create_unet(data, models.resnet34, metrics=metrics)

from lesson3-camvid.ipynb crashes the kernel, to make sure it is not my setup I tested it on a new AWS instance using Deep Learning AMI (Ubuntu) Version 18.0 (ami-0688c8f24f1c0e235) AMI but it still crashes
I followed the steps described on fastai github repo to install nightly torch and installed the latest fastai from git,

the installed version I have are

~$ python -c "import torch; print(torch.__version__)"
1.0.0.dev20181116
~$ python -c "import fastai; print(fastai.__version__)"
1.0.24

trying to debug the problem I narrowed it down to the function model_sizes in fastai/callbacks/hooks.py when running the line x = m.eval()(x).

my strong suspicion it is caused by some native library in Pytorch,

anyone faced this kind of crashes or any suggestion I wasted a day on this.

keyurparalkar · November 16, 2018, 5:10pm

@aymenim I am facing similar issue the error I am receiving is:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     24     if opt is not None:
---> 25         loss = cb_handler.on_backward_begin(loss)
     26         loss.backward()

~/anaconda3/lib/python3.6/site-packages/fastai/callback.py in on_backward_begin(self, loss)
    220         "Handle gradient calculation on `loss`."
--> 221         self.smoothener.add_value(loss.detach().cpu())
    222         self.state_dict['last_loss'], self.state_dict['smooth_loss'] = loss, self.smoothener.smooth

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/THC/generic/THCTensorCopy.cpp:75

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-10-dd390b1c8108> in <module>
----> 1 lr_find(learn)
      2 learn.recorder.plot()

~/anaconda3/lib/python3.6/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, **kwargs)
     26     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     27     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 28     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     29 
     30 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     93         exception = e
     94         raise e
---> 95     finally: cb_handler.on_train_end(exception)
     96 
     97 loss_func_name2activ = {'cross_entropy_loss': partial(F.softmax, dim=1), 'nll_loss': torch.exp, 'poisson_nll_loss': torch.exp,

~/anaconda3/lib/python3.6/site-packages/fastai/callback.py in on_train_end(self, exception)
    254     def on_train_end(self, exception:Union[bool,Exception])->None:
    255         "Handle end of training, `exception` is an `Exception` or False if no exceptions during training."
--> 256         self('train_end', exception=exception)
    257 
    258 class AverageMetric(Callback):

~/anaconda3/lib/python3.6/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    185         "Call through to all of the `CallbakHandler` functions."
    186         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 187         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    188 
    189     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/lib/python3.6/site-packages/fastai/callback.py in <listcomp>(.0)
    185         "Call through to all of the `CallbakHandler` functions."
    186         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 187         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    188 
    189     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/lib/python3.6/site-packages/fastai/callbacks/lr_finder.py in on_train_end(self, **kwargs)
     45         # restore the valid_dl we turned off on `__init__`
     46         self.data.valid_dl = self.valid_dl
---> 47         self.learn.load('tmp')
     48         if hasattr(self.learn.model, 'reset'): self.learn.model.reset()
     49         print('LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.')

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in load(self, name, device)
    202         "Load model `name` from `self.model_dir` using `device`, defaulting to `self.data.device`."
    203         if device is None: device = self.data.device
--> 204         self.model.load_state_dict(torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device))
    205         return self
    206 

~/anaconda3/lib/python3.6/site-packages/torch/serialization.py in load(f, map_location, pickle_module)
    356         f = open(f, 'rb')
    357     try:
--> 358         return _load(f, map_location, pickle_module)
    359     finally:
    360         if new_fd:

~/anaconda3/lib/python3.6/site-packages/torch/serialization.py in _load(f, map_location, pickle_module)
    527     unpickler = pickle_module.Unpickler(f)
    528     unpickler.persistent_load = persistent_load
--> 529     result = unpickler.load()
    530 
    531     deserialized_storage_keys = pickle_module.load(f)

~/anaconda3/lib/python3.6/site-packages/torch/serialization.py in persistent_load(saved_id)
    493             if root_key not in deserialized_objects:
    494                 deserialized_objects[root_key] = restore_location(
--> 495                     data_type(size), location)
    496             storage = deserialized_objects[root_key]
    497             if view_metadata is not None:

~/anaconda3/lib/python3.6/site-packages/torch/serialization.py in restore_location(storage, location)
    376     elif isinstance(map_location, torch.device):
    377         def restore_location(storage, location):
--> 378             return default_restore_location(storage, str(map_location))
    379     else:
    380         def restore_location(storage, location):

~/anaconda3/lib/python3.6/site-packages/torch/serialization.py in default_restore_location(storage, location)
    102 def default_restore_location(storage, location):
    103     for _, _, fn in _package_registry:
--> 104         result = fn(storage, location)
    105         if result is not None:
    106             return result

~/anaconda3/lib/python3.6/site-packages/torch/serialization.py in _cuda_deserialize(obj, location)
     84                                'to an existing device.'.format(
     85                                    device, torch.cuda.device_count()))
---> 86         return obj.cuda(device)
     87 
     88 

~/anaconda3/lib/python3.6/site-packages/torch/_utils.py in _cuda(self, device, non_blocking, **kwargs)
     74         else:
     75             new_type = getattr(torch.cuda, self.__class__.__name__)
---> 76             return new_type(self.size()).copy_(self, non_blocking)
     77 
     78 

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch-nightly_1539602533843/work/aten/src/THC/generic/THCTensorCopy.cpp:20

I think while calculating Loss this error is generated. Is it a bug in current fastai v1.0.24 ?

krishnakalyan3 · November 17, 2018, 6:22am

I had the same issue yesterday. I strongly suspect its because of the deep learning AMI 18.0. I will try a different deep learning AMI and let you know my experiments went!.

64-bit (x86)

Deep Learning AMI (Ubuntu) Version 18.0

paperwave · February 2, 2019, 5:51pm

I’m got the same problem on Ubuntu 18.04 and fastai 1.0.42

It trains for 1 epoch on batch size=8. Then it crashes while it is classifying the validation set.

Error message:
RuntimeError: CUDA out of memory. Tried to allocate 522.12 MiB (GPU 0; 10.91 GiB total capacity; 7.27 GiB already allocated; 343.69 MiB free; 1.59 GiB cached)

That error message makes no sense (perhaps I am missing something); 7.27 GiB + 1.59 GiB + 522.12Mib = 9.38 GiB which is less than 10.91. The 343. MiB free is the part I don’t understand. Perhaps this error message doesn’t count the part of the GPU the model parameters run on.

I restarted the notebook, changed the batch size to 4, and reran the notebook up to the learn.save(‘stage-1’). It worked.

It worked.

iNLyze · February 21, 2019, 12:24pm

@keyurparalkar Not sure if it is related to your problem, but I once ran into a nasty form of this Runtime error. I ran a simple model for testing with MNIST and CIFAR10 and the CUDA assertion error kept re-appearing. Normally I load the data batches from SSD, but here I didn’t bother and loaded from a HDD raid (LVM under ubuntu 16.04). Tried many workarounds, none worked. Finally, I copied it to SSD and ran it from there - error gone! So, I interpret this as a timing problem. Perhaps this happens more easily with tiny thumbnail pictures on a fast GPU which finishes too early (??) .Not sure exactly why this happens, but the error in my case surfaced in the module NLLloss. When debugging in Jupyter I couldn’t even repr the input variable without error. So, think about load timing, bandwidth and asynchronous processes with many num_workers.

wdhorton · October 13, 2019, 5:00pm

Did anyone ever solve this? Seeing the same thing.