Device-side assert triggered - THCTensorCopy

saidaspen · March 23, 2018, 3:36pm

I have been spending most of today with an error I just cannot figure out.

I am trying to do binary classification on structured data and all works perfectly fine during training but when I am trying to do prediction on the test-set it falls apart with the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-67-0b44351ee6aa> in <module>()
----> 1 preds = learn.predict_dl(md.test_dl)

~/projects/fastai/fastai/learner.py in predict_dl(self, dl)
    266         return predict_with_targs(self.model, dl)
    267 
--> 268     def predict_dl(self, dl): return predict_with_targs(self.model, dl)[0]
    269 
    270     def predict_array(self, arr):

~/projects/fastai/fastai/model.py in predict_with_targs(m, dl)
    150 
    151 def predict_with_targs(m, dl):
--> 152     preda,targa = predict_with_targs_(m, dl)
    153     return to_np(torch.cat(preda)), to_np(torch.cat(targa))
    154 

~/projects/fastai/fastai/model.py in predict_with_targs_(m, dl)
    146     if hasattr(m, 'reset'): m.reset()
    147     res = []
--> 148     for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
    149     return zip(*res)
    150 

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

<ipython-input-53-895619868d8b> in forward(self, x_cat, x_cont)
     34             if self.use_bn: x = b(x)
     35             x = d(x)
---> 36         x = self.outp(x)
     37         if self.y_range:
     38             x = F.sigmoid(x)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
     53 
     54     def forward(self, input):
---> 55         return F.linear(input, self.weight, self.bias)
     56 
     57     def __repr__(self):

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
    833     if input.dim() == 2 and bias is not None:
    834         # fused op is marginally faster
--> 835         return torch.addmm(bias, input, weight.t())
    836 
    837     output = input.matmul(weight.t())

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorCopy.cu:204

The code I am trying to run is this:

model = MixedInputModel(emb_szs, n_cont=0, emb_drop=0, out_sz=2, szs=[500], drops=[0.5]).cuda()
bm = BasicModel(model, 'binary_classifier')
trn_df, trn_y = df.iloc[:train_size], y[:train_size]
val_df, val_y = df.iloc[train_size:], y[train_size:]
md = ColumnarModelData.from_data_frames(DIR_ROOT, trn_df, val_df, trn_y.astype('int'), val_y.astype('int'), cat_vars, 128, test_df=df_test)
learn = StructuredLearner(md, bm)
learn.crit = F.cross_entropy

The basic idea is from : https://github.com/KeremTurgutlu/deeplearning/blob/master/avazu/FAST.AI%20Classification%20-%20Kaggle%20Avazu%20CTR.ipynb

But I get the another error if I instead use the fast.ai out-of-box code and run (similar to the Rossman notebook):

m = md.get_learner(emb_szs, len(df.columns) - len(cat_vars), 0.5, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.crit = F.binary_cross_entropy
pred_test=m.predict(False)

In this case I get:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THCUNN/generic/Threshold.cu:34

So. No matter what I do it seems my test-set predictions are crashing.

This is how the data frames look like:

Anyone have any idea about what to do?

Ralph · March 23, 2018, 5:53pm

smthPyTorch Dev, Facebook AI Research

Feb '17

Because of the asynchronous nature of cuda, the assert might not point to a full correct stack trace pointing to where the assert was triggered from.

if you run the program with CUDA_LAUNCH_BLOCKING=1 python script.py
this will help get a more exact stack trace

The blocking will make it run on CPU and may give you a more helpful trace

See https://discuss.pytorch.org/t/runtimeerror-cuda-runtime-error-59/750/5

saidaspen · March 26, 2018, 3:59am

I have set this on the server directly:

export CUDA_LAUNCH_BLOCKING=1

as well as starting the Jupyter notebook up with a cell running:

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

Error message now is already when doing:

m.lr_find()

So, it might not be the same…

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-d81c6bd29d71> in <module>()
----> 1 learn.lr_find()

~/projects/fastai/fastai/learner.py in lr_find(self, start_lr, end_lr, wds, linear)
    255         layer_opt = self.get_layer_opt(start_lr, wds)
    256         self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr, linear=linear)
--> 257         self.fit_gen(self.model, self.data, layer_opt, 1)
    258         self.load('tmp')
    259 

~/projects/fastai/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, **kwargs)
    159         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
    160         return fit(model, data, n_epoch, layer_opt.opt, self.crit,
--> 161             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
    162 
    163     def get_layer_groups(self): return self.models.get_layer_groups()

~/projects/fastai/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, stepper, **kwargs)
     94             batch_num += 1
     95             for cb in callbacks: cb.on_batch_begin()
---> 96             loss = stepper.step(V(x),V(y), epoch)
     97             avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
     98             debias_loss = avg_loss / (1 - avg_mom**batch_num)

~/projects/fastai/fastai/model.py in step(self, xs, y, epoch)
     38     def step(self, xs, y, epoch):
     39         xtra = []
---> 40         output = self.m(*xs)
     41         if isinstance(output,tuple): output,*xtra = output
     42         self.opt.zero_grad()

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

<ipython-input-16-895619868d8b> in forward(self, x_cat, x_cont)
     24     def forward(self, x_cat, x_cont):
     25         if self.n_emb != 0:
---> 26             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
     27             x = torch.cat(x, 1)
     28             x = self.emb_drop(x)

<ipython-input-16-895619868d8b> in <listcomp>(.0)
     24     def forward(self, x_cat, x_cont):
     25         if self.n_emb != 0:
---> 26             x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
     27             x = torch.cat(x, 1)
     28             x = self.emb_drop(x)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    101             input, self.weight,
    102             padding_idx, self.max_norm, self.norm_type,
--> 103             self.scale_grad_by_freq, self.sparse
    104         )
    105 

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/nn/_functions/thnn/sparse.py in forward(cls, ctx, indices, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
     45 
     46         if not indices.is_contiguous():
---> 47             ctx._indices = indices.contiguous()
     48             indices = ctx._indices
     49         else:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorCopy.cu:204

So I am guessing it has to do with indexes. Possibly on the data frames.
This is how my training set looks along with the categorical values I am setting:

Even · April 4, 2018, 2:06am

I was getting the same error now just trying to run the basic ColumnarModelData.from_data_frame call but I managed to fix it.

The key for me was that my categorical assignments needed to be contiguous. If they’re anything other than 0-N and if your embeddings are anything other than size N+1 then tensorflow fails to properly index the embeddings.

There’s a function there that looks like it’s supposed to do the mapping for you, but it doesn’t seem to work.

shwetap7 · April 5, 2018, 12:08pm

@saidaspen I am also getting same error.

shwetap7 · April 5, 2018, 12:10pm

how you choose embedding size

Even · April 20, 2018, 3:25am

If you look in Jeremy’s notebook he suggests a method. Essentially it’s max(ncategorys/2,50)

nok · May 3, 2018, 4:56pm

Did anyone find solution on this? I think I encounter similar issue… Thanks!

Farah · May 5, 2018, 2:01am

I am having the same error, how you check to see if the categorical assignments are contiguous in the dataframe and also how do you set them to be contiguous? Thanks.

Even · May 5, 2018, 3:54am

They should all be values between 0 and ncategories-1 for an embedding.

Hadus · May 8, 2018, 12:44pm

I had the same problem and here is the easiest thing to do to fix it: FIX: F.binary_cross_entropy keeps crashing the GPU

msmedes · May 19, 2018, 1:54pm

df.dtypes will show you the columns in order and their datatype. What Jeremy does in the rossman notebook is make two lists, one of cat vars and the other of continuous vars. He then constructs a dataframe copy from those lists like so:
joined = joined[cat_vars+contin_vars+[dep, 'Date']].copy()

RogerS49 · June 19, 2018, 10:56am

@saidaspen Thanks for the Blocking tip. The result of which a more informed message appeared in the notebook log in the terminal window.

/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/
/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T,
IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int,
int, IndexType, IndexType, long) [with T = float, IndexType = unsigned
int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block:
[3,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize`
/failed.
/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/
/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T,
IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int,
int, IndexType, IndexType, long) [with T = float, IndexType = unsigned
int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block:
[3,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize`
failed.
/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/
/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T,
IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int,
int, IndexType, IndexType, long) [with T = float, IndexType = unsigned
int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block:
[3,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize`
failed.
THCudaCheck FAIL
file=/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/
THCTensorCopy.cu line=204 error=59 : device-side assert triggered

Not sure this is helpful. This was triggered by the m.lr_find().
My version of pytorch is 0.3.1 [cuda90]. You may be using version 0.4.0
I am working on a different problem data set from a Coursera course hosted on Kaggle. I am using this to try and form a better understanding of fast. My starting point is the Rossmann notebook.

HiccupinGminor · July 11, 2018, 11:41pm

@saidaspen

I got this error too during the predict() call, and after considerable debugging effort, figured out what the problem is.

Using similar code to the Rossman notebook, I trained a separate dataset.

The “device-side assert triggered” was popping up for categorical vars where the test dataframe’s cardinality was greater than the training + validation dataframes’ cardinality

Since the embeddings are generated off of the training dataframe’s categorical vars, there will be indexing errors if your test dataframe vars have greater cardinalities.

Hope that helps someone save some time.

maxim.pechyonkin · October 4, 2018, 9:46am

I agree with you. The problem here is the fact that the number of classes in training data set and in test data set are different. I encountered this problem in Dog Breeds and Whale identification Kaggle competitions.

sachinkundu · October 9, 2018, 10:02am

@maxim.pechyonkin how did you solve the problem? I am also trying to work with the whale identification dataset and do see
len(pd.Series(data.trn_ds.y).unique())
and len(pd.Series(data.val_ds.y).unique()) report different numbers.

Is that the problem you are referring to?

MadeUpMasters · November 13, 2019, 5:11pm

If anyone else runs into this problem I found this article to be exceedingly helpful.