CUDA error when trying to predict with test set on Columar Model

shaun1 · March 30, 2018, 3:16pm

Hello,

I am working on the Allstate claims severity Kaggle competition. After training, I try to make predictions on the test set and I get a RuntimeError: CUDNN_STATUS_MAPPING_ERROR. Here is the code:

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,
                                      test_df=df_test)
m = md.get_learner(emb_szs, n_cont+n_bincat, 0.04, 1, [1000, 500], [0.01, 0.1], y_range=y_range)
pred_test = m.predict(True)

I did not include the code not relevant to this issue. Please let me know if that is required. Following is the error and the stack trace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-ee28ccb05714> in <module>()
----> 1 pred_test = m.predict(True)

~/kaggle-comps/allstate/fastai/learner.py in predict(self, is_test)
    271     def predict(self, is_test=False):
    272         dl = self.data.test_dl if is_test else self.data.val_dl
--> 273         return predict(self.model, dl)
    274 
    275     def predict_with_targs(self, is_test=False):

~/kaggle-comps/allstate/fastai/model.py in predict(m, dl)
    134 
    135 def predict(m, dl):
--> 136     preda,_ = predict_with_targs_(m, dl)
    137     return to_np(torch.cat(preda))
    138 

~/kaggle-comps/allstate/fastai/model.py in predict_with_targs_(m, dl)
    146     if hasattr(m, 'reset'): m.reset()
    147     res = []
--> 148     for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
    149     return zip(*res)
    150 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/kaggle-comps/allstate/fastai/column_data.py in forward(self, x_cat, x_cont)
    112             x = self.emb_drop(x)
    113         if self.n_cont != 0:
--> 114             x2 = self.bn(x_cont)
    115             x = torch.cat([x, x2], 1) if self.n_emb != 0 else x2
    116         for l,d,b in zip(self.lins, self.drops, self.bns):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py in forward(self, input)
     35         return F.batch_norm(
     36             input, self.running_mean, self.running_var, self.weight, self.bias,
---> 37             self.training, self.momentum, self.eps)
     38 
     39     def __repr__(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
   1011             raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
   1012     f = torch._C._functions.BatchNorm(running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled)
-> 1013     return f(input, weight, bias)
   1014 
   1015 

RuntimeError: CUDNN_STATUS_MAPPING_ERROR

Any help is appreciated.

shaun1 · April 1, 2018, 4:49pm

I deleted all my intermediate files and started from scratch and I noticed that I got a different error this time. This is also during testing. Training ran without any trouble:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-ee28ccb05714> in <module>()
----> 1 pred_test = m.predict(True)

~/kaggle-comps/allstate/fastai/learner.py in predict(self, is_test)
    271     def predict(self, is_test=False):
    272         dl = self.data.test_dl if is_test else self.data.val_dl
--> 273         return predict(self.model, dl)
    274 
    275     def predict_with_targs(self, is_test=False):

~/kaggle-comps/allstate/fastai/model.py in predict(m, dl)
    134 
    135 def predict(m, dl):
--> 136     preda,_ = predict_with_targs_(m, dl)
    137     return to_np(torch.cat(preda))
    138 

~/kaggle-comps/allstate/fastai/model.py in predict_with_targs_(m, dl)
    146     if hasattr(m, 'reset'): m.reset()
    147     res = []
--> 148     for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
    149     return zip(*res)
    150 

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/kaggle-comps/allstate/fastai/column_data.py in forward(self, x_cat, x_cont)
    109         if self.n_emb != 0:
    110             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
--> 111             x = torch.cat(x, 1)
    112             x = self.emb_drop(x)
    113         if self.n_cont != 0:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCGeneral.c:844

I also noticed that sometimes the RuntimeError: CUDNN_STATUS_MAPPING_ERROR would also occur and sometimes this error would occur. I would really appreciate some help in this. I’m curious whether this is an issue with pytorch or with CUDA itself.

I’m running this on paperspace fastAI template with latest master pull of fastai.

Thanks.

MicPie · June 12, 2018, 5:25pm

Hey @shaun1,

I’m having the same problem. Did you solved it?

Best regards
Michael

kachun1017 · September 15, 2018, 7:16am

I am having the same problem too,
however my first attempt on predicting test set has no problem at all

abhiksark · March 13, 2019, 4:29am

I am facing the same issue. Anyone solved this issue?

abhiksark · March 13, 2019, 4:33am

I think this is ‘Resource Exhausted’ error. Changing the batch size helps.

reidfu · June 2, 2019, 3:05am

I got this error after training was ~20% done. I added “learn.model.cuda()” before the training code, and I stopped getting this error.