WideResNet (wrn_22) Cuda Error in Lesson1-pets

mindtrinket · May 4, 2019, 7:06am

I am trying to implement WideResNet for a project and kept getting the following Cuda error.

RuntimeError: CUDA error: device-side assert triggered

Which as sgugger explains is a generic bad index.

Any idea of how to problem-shoot and solve? I have tried:

Resetting everything (Did you turn it off and on?)
Googled most are (Masking off,
Re-updating everything
Running it in lesson1-pets
Reducing batch-size to 10

I took a screenshot in lesson1-pets. While Resnet still works, WRN_22 has the device-side assert triggered.

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
in
----> 1 learn.fit_one_cycle(1)

~/anaconda3/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     21                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, wd:float=None):

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    197         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
    198         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
--> 199         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    200 
    201     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     99             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
    100                 xb, yb = cb_handler.on_batch_begin(xb, yb)
--> 101                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    102                 if cb_handler.on_batch_end(loss): break
    103 

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     31 
     32     if opt is not None:
---> 33         loss,skip_bwd = cb_handler.on_backward_begin(loss)
     34         if not skip_bwd:                     loss.backward()
     35         if not cb_handler.on_backward_end(): opt.step()

~/anaconda3/lib/python3.7/site-packages/fastai/callback.py in on_backward_begin(self, loss)
    288     def on_backward_begin(self, loss:Tensor)->None:
    289         "Handle gradient calculation on `loss`."
--> 290         self.smoothener.add_value(loss.detach().cpu())
    291         self.state_dict['last_loss'], self.state_dict['smooth_loss'] = loss, self.smoothener.smooth
    292         self('backward_begin', call_mets=False)

RuntimeError: CUDA error: device-side assert triggered

sgugger · May 4, 2019, 12:42pm

The wide resnet 22 is intended for CIFAR10, so it has an output hardcoded with 10 classes. You should use the functions in that module to create a suitable model for you

mindtrinket · May 4, 2019, 9:49pm

Gotcha, this seems to work.

def wrn_Custom(num_groups=3, N=3, num_classes=10, k=6, drop_p=0.): 
    "Default Wide ResNet has 22 layers."
    return WideResNet(num_groups, N, num_classes, k, drop_p)

Accuracy is a bit less and will update when I get back up there.

mindtrinket · May 5, 2019, 7:10pm

*Many epochs later. It just occurred to me, I am not getting very good results compared to ResNet moving over to WideResNet because there wasn’t any pre-trained WideResNet on imagenet to transfer over.

Any recommendations on where I can go to get these weights?

Widenet- In the middle of 20 hours of training

Resnet 50 at the end of Lesson1-pets

sgugger · May 5, 2019, 9:47pm

I see a wide resnet 50 on Cadene pretrained models, but that’s all I could find.

mindtrinket · May 5, 2019, 10:52pm

Awesome, that is very helpful for more than just WideResNet! Apologize, this thread had gone a little off-topic and thank you for helping me troubleshoot it all!

After noticing that Cadene’s RexNeXt is more popular, I found this recent paper from NIPS suggesting RexNext would be better anyway and will explore that route.

Also, I really like the following notebook explaining pre-trained Cadene models.

Edit: Yeah it got a little better

divyansh · October 5, 2019, 7:08am

CUDA Device side assert triggered error occur for a number of reasons, is there a simple way to debug?
In my case it takes quite long to get to the actual error. Amy rule of thumb sort of?