CUDA error when trying to predict with test set on Columar Model

Hello,

I am working on the Allstate claims severity Kaggle competition. After training, I try to make predictions on the test set and I get a RuntimeError: CUDNN_STATUS_MAPPING_ERROR. Here is the code:

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,
                                      test_df=df_test)
m = md.get_learner(emb_szs, n_cont+n_bincat, 0.04, 1, [1000, 500], [0.01, 0.1], y_range=y_range)
pred_test = m.predict(True)

I did not include the code not relevant to this issue. Please let me know if that is required. Following is the error and the stack trace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-ee28ccb05714> in <module>()
----> 1 pred_test = m.predict(True)

~/kaggle-comps/allstate/fastai/learner.py in predict(self, is_test)
    271     def predict(self, is_test=False):
    272         dl = self.data.test_dl if is_test else self.data.val_dl
--> 273         return predict(self.model, dl)
    274 
    275     def predict_with_targs(self, is_test=False):

~/kaggle-comps/allstate/fastai/model.py in predict(m, dl)
    134 
    135 def predict(m, dl):
--> 136     preda,_ = predict_with_targs_(m, dl)
    137     return to_np(torch.cat(preda))
    138 

~/kaggle-comps/allstate/fastai/model.py in predict_with_targs_(m, dl)
    146     if hasattr(m, 'reset'): m.reset()
    147     res = []
--> 148     for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
    149     return zip(*res)
    150 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/kaggle-comps/allstate/fastai/column_data.py in forward(self, x_cat, x_cont)
    112             x = self.emb_drop(x)
    113         if self.n_cont != 0:
--> 114             x2 = self.bn(x_cont)
    115             x = torch.cat([x, x2], 1) if self.n_emb != 0 else x2
    116         for l,d,b in zip(self.lins, self.drops, self.bns):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py in forward(self, input)
     35         return F.batch_norm(
     36             input, self.running_mean, self.running_var, self.weight, self.bias,
---> 37             self.training, self.momentum, self.eps)
     38 
     39     def __repr__(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
   1011             raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
   1012     f = torch._C._functions.BatchNorm(running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled)
-> 1013     return f(input, weight, bias)
   1014 
   1015 

RuntimeError: CUDNN_STATUS_MAPPING_ERROR

Any help is appreciated.

2 Likes

I deleted all my intermediate files and started from scratch and I noticed that I got a different error this time. This is also during testing. Training ran without any trouble:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-ee28ccb05714> in <module>()
----> 1 pred_test = m.predict(True)

~/kaggle-comps/allstate/fastai/learner.py in predict(self, is_test)
    271     def predict(self, is_test=False):
    272         dl = self.data.test_dl if is_test else self.data.val_dl
--> 273         return predict(self.model, dl)
    274 
    275     def predict_with_targs(self, is_test=False):

~/kaggle-comps/allstate/fastai/model.py in predict(m, dl)
    134 
    135 def predict(m, dl):
--> 136     preda,_ = predict_with_targs_(m, dl)
    137     return to_np(torch.cat(preda))
    138 

~/kaggle-comps/allstate/fastai/model.py in predict_with_targs_(m, dl)
    146     if hasattr(m, 'reset'): m.reset()
    147     res = []
--> 148     for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
    149     return zip(*res)
    150 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/kaggle-comps/allstate/fastai/column_data.py in forward(self, x_cat, x_cont)
    109         if self.n_emb != 0:
    110             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
--> 111             x = torch.cat(x, 1)
    112             x = self.emb_drop(x)
    113         if self.n_cont != 0:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCGeneral.c:844

I also noticed that sometimes the RuntimeError: CUDNN_STATUS_MAPPING_ERROR would also occur and sometimes this error would occur. I would really appreciate some help in this. Iā€™m curious whether this is an issue with pytorch or with CUDA itself.

Iā€™m running this on paperspace fastAI template with latest master pull of fastai.

Thanks.

Hey @shaun1,

Iā€™m having the same problem. Did you solved it?

Best regards
Michael

I am having the same problem too,
however my first attempt on predicting test set has no problem at all

I am facing the same issue. Anyone solved this issue?

I think this is ā€˜Resource Exhaustedā€™ error. Changing the batch size helps.

I got this error after training was ~20% done. I added ā€œlearn.model.cuda()ā€ before the training code, and I stopped getting this error.