Device-side assert triggered - THCTensorCopy


(Said Aspen) #1

I have been spending most of today with an error I just cannot figure out.

I am trying to do binary classification on structured data and all works perfectly fine during training but when I am trying to do prediction on the test-set it falls apart with the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-67-0b44351ee6aa> in <module>()
----> 1 preds = learn.predict_dl(md.test_dl)

~/projects/fastai/fastai/learner.py in predict_dl(self, dl)
    266         return predict_with_targs(self.model, dl)
    267 
--> 268     def predict_dl(self, dl): return predict_with_targs(self.model, dl)[0]
    269 
    270     def predict_array(self, arr):

~/projects/fastai/fastai/model.py in predict_with_targs(m, dl)
    150 
    151 def predict_with_targs(m, dl):
--> 152     preda,targa = predict_with_targs_(m, dl)
    153     return to_np(torch.cat(preda)), to_np(torch.cat(targa))
    154 

~/projects/fastai/fastai/model.py in predict_with_targs_(m, dl)
    146     if hasattr(m, 'reset'): m.reset()
    147     res = []
--> 148     for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
    149     return zip(*res)
    150 

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

<ipython-input-53-895619868d8b> in forward(self, x_cat, x_cont)
     34             if self.use_bn: x = b(x)
     35             x = d(x)
---> 36         x = self.outp(x)
     37         if self.y_range:
     38             x = F.sigmoid(x)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
     53 
     54     def forward(self, input):
---> 55         return F.linear(input, self.weight, self.bias)
     56 
     57     def __repr__(self):

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
    833     if input.dim() == 2 and bias is not None:
    834         # fused op is marginally faster
--> 835         return torch.addmm(bias, input, weight.t())
    836 
    837     output = input.matmul(weight.t())

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorCopy.cu:204

The code I am trying to run is this:

model = MixedInputModel(emb_szs, n_cont=0, emb_drop=0, out_sz=2, szs=[500], drops=[0.5]).cuda()
bm = BasicModel(model, 'binary_classifier')
trn_df, trn_y = df.iloc[:train_size], y[:train_size]
val_df, val_y = df.iloc[train_size:], y[train_size:]
md = ColumnarModelData.from_data_frames(DIR_ROOT, trn_df, val_df, trn_y.astype('int'), val_y.astype('int'), cat_vars, 128, test_df=df_test)
learn = StructuredLearner(md, bm)
learn.crit = F.cross_entropy

The basic idea is from : https://github.com/KeremTurgutlu/deeplearning/blob/master/avazu/FAST.AI%20Classification%20-%20Kaggle%20Avazu%20CTR.ipynb

But I get the another error if I instead use the fast.ai out-of-box code and run (similar to the Rossman notebook):

m = md.get_learner(emb_szs, len(df.columns) - len(cat_vars), 0.5, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.crit = F.binary_cross_entropy
pred_test=m.predict(False)

In this case I get:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THCUNN/generic/Threshold.cu:34

So. No matter what I do it seems my test-set predictions are crashing.

This is how the data frames look like:

Anyone have any idea about what to do?


#2

smthPyTorch Dev, Facebook AI Research

Feb '17

Because of the asynchronous nature of cuda, the assert might not point to a full correct stack trace pointing to where the assert was triggered from.

if you run the program with CUDA_LAUNCH_BLOCKING=1 python script.py
this will help get a more exact stack trace

The blocking will make it run on CPU and may give you a more helpful trace

See https://discuss.pytorch.org/t/runtimeerror-cuda-runtime-error-59/750/5


(Said Aspen) #3

I have set this on the server directly:

export CUDA_LAUNCH_BLOCKING=1

as well as starting the Jupyter notebook up with a cell running:

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

Error message now is already when doing:

m.lr_find()

So, it might not be the same…

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-d81c6bd29d71> in <module>()
----> 1 learn.lr_find()

~/projects/fastai/fastai/learner.py in lr_find(self, start_lr, end_lr, wds, linear)
    255         layer_opt = self.get_layer_opt(start_lr, wds)
    256         self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr, linear=linear)
--> 257         self.fit_gen(self.model, self.data, layer_opt, 1)
    258         self.load('tmp')
    259 

~/projects/fastai/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, **kwargs)
    159         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
    160         return fit(model, data, n_epoch, layer_opt.opt, self.crit,
--> 161             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
    162 
    163     def get_layer_groups(self): return self.models.get_layer_groups()

~/projects/fastai/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, stepper, **kwargs)
     94             batch_num += 1
     95             for cb in callbacks: cb.on_batch_begin()
---> 96             loss = stepper.step(V(x),V(y), epoch)
     97             avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
     98             debias_loss = avg_loss / (1 - avg_mom**batch_num)

~/projects/fastai/fastai/model.py in step(self, xs, y, epoch)
     38     def step(self, xs, y, epoch):
     39         xtra = []
---> 40         output = self.m(*xs)
     41         if isinstance(output,tuple): output,*xtra = output
     42         self.opt.zero_grad()

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

<ipython-input-16-895619868d8b> in forward(self, x_cat, x_cont)
     24     def forward(self, x_cat, x_cont):
     25         if self.n_emb != 0:
---> 26             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
     27             x = torch.cat(x, 1)
     28             x = self.emb_drop(x)

<ipython-input-16-895619868d8b> in <listcomp>(.0)
     24     def forward(self, x_cat, x_cont):
     25         if self.n_emb != 0:
---> 26             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
     27             x = torch.cat(x, 1)
     28             x = self.emb_drop(x)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    101             input, self.weight,
    102             padding_idx, self.max_norm, self.norm_type,
--> 103             self.scale_grad_by_freq, self.sparse
    104         )
    105 

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/_functions/thnn/sparse.py in forward(cls, ctx, indices, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
     45 
     46         if not indices.is_contiguous():
---> 47             ctx._indices = indices.contiguous()
     48             indices = ctx._indices
     49         else:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorCopy.cu:204

So I am guessing it has to do with indexes. Possibly on the data frames.
This is how my training set looks along with the categorical values I am setting:


(Even Oldridge) #4

I was getting the same error now just trying to run the basic ColumnarModelData.from_data_frame call but I managed to fix it.

The key for me was that my categorical assignments needed to be contiguous. If they’re anything other than 0-N and if your embeddings are anything other than size N+1 then tensorflow fails to properly index the embeddings.

There’s a function there that looks like it’s supposed to do the mapping for you, but it doesn’t seem to work.


(shweta ) #5

@saidaspen I am also getting same error.


(shweta ) #6

how you choose embedding size