FIX: F.binary_cross_entropy keeps crashing the GPU

After hours of search and restarting the whole system to reset the gpu like 20 times :slight_smile:

I finally found that its very easy to fix just by replacing it with nn.BCEWithLogitsLoss()

1 Like

are you just replacing m.crit = F.binary_cross_entropy with m.crit = nn.BCEWithLogitsLoss() before predicting?

Yes, that is because with F.binary_cross_entropy, the output of the model (i.e. final layer) does not have a Sigmoid() layer, hence why the GPU crash, most likely due to the negative predictions in the ouput (as @hiromi pointed this out in the Cuda runtime error (59) post).

WIth nn.BCEWithLogitsLoss() the Sigmoid() layer is added before computing the binary cross entropy loss (see doc nn.BCEWithLogitsLoss())

2 Likes

When I had F.binary_cross_entropy I did use Sigmoid(). But that could have been a problem for others I guess.

Same here. According to the docs, the nn.BCEWithLogitsLoss() “is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability”

1 Like

Cool insight!

Could you tell me please where you are replacing this line? I am trying to apply the notebook from Lesson 1 to a new data set and it keeps crashing. I am just following the steps and get cuda runtime error (59). My photos are all ok, I have no idea why this is happening.

Lesson 1 doesn’t use F.binary_cross_entropy. It uses F.nll_loss. You can check what the criterion is by printing learn.crit. You have a different problem.

1 Like

I switched to nn.BCEWithLogitsLoss() and checked with m.crit that it switched over. Still getting cuda runtime rror (59) when I try to run m.predict(True). the model dataloader is as follows (it’s essentially the rossman one):
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y.astype(float32), cat_flds=cur_cats, bs=128, test_df=df_test)

and the learner:

m = md.get_learner(emb_szs, len(df.columns)-len(cur_cats), 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)

I have y_range set to a tuple of (0,1), not sure if that’s correct or if that is the problem or what the heck is going on.

Maybe your batch size is too large?
Also what is the exact error message?

I had errors regarding float datatype. I replaced np.float32 with np.float in many places and got rid of all errors. Maybe this trick helps you as well :slight_smile:

1 Like

Batch size is fine, I have plenty of memory left so I don’t think it’s that. The exact error is:


RuntimeError Traceback (most recent call last)
in ()
----> 1 pred_test=m.predict(True)

~/fastai/courses/dl1/fastai/learner.py in predict(self, is_test, use_swa)
355 dl = self.data.test_dl if is_test else self.data.val_dl
356 m = self.swa_model if use_swa else self.model
–> 357 return predict(m, dl)
358
359 def predict_with_targs(self, is_test=False, use_swa=False):

~/fastai/courses/dl1/fastai/model.py in predict(m, dl)
227
228 def predict(m, dl):
–> 229 preda,_ = predict_with_targs_(m, dl)
230 return np.concatenate(preda)
231

~/fastai/courses/dl1/fastai/model.py in predict_with_targs_(m, dl)
239 if hasattr(m, ‘reset’): m.reset()
240 res = []
–> 241 for *x,y in iter(dl): res.append([get_prediction(to_np(m(*VV(x)))),to_np(y)])
242 return zip(*res)
243

~/miniconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
–> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

~/fastai/courses/dl1/fastai/column_data.py in forward(self, x_cat, x_cont)
112 def forward(self, x_cat, x_cont):
113 if self.n_emb != 0:
–> 114 x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
115 x = torch.cat(x, 1)
116 x = self.emb_drop(x)

~/fastai/courses/dl1/fastai/column_data.py in (.0)
112 def forward(self, x_cat, x_cont):
113 if self.n_emb != 0:
–> 114 x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
115 x = torch.cat(x, 1)
116 x = self.emb_drop(x)

~/miniconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
–> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

~/miniconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
101 input, self.weight,
102 padding_idx, self.max_norm, self.norm_type,
–> 103 self.scale_grad_by_freq, self.sparse
104 )
105

~/miniconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/_functions/thnn/sparse.py in forward(cls, ctx, indices, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
45
46 if not indices.is_contiguous():
—> 47 ctx._indices = indices.contiguous()
48 indices = ctx._indices
49 else:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorCopy.cu:204

So at this point I’m pretty sure it has something to do with indices (of something) and then somehow not being contiguous.

what is your pytorch version? 0.3.1 works best, 0.4 should also, maybe :slight_smile: but more recent ones not so sure…

I have a similar problem (see here: Structured Learner) but so far I could not find a good hint on what is wrong.

I checked my categorical variables and found a mistake (index was not set) that I corrected.
Also reducing the bs did not help (I also thought maybe I’m out of gpu ram and pytorch was not able to load it properly).

I use is_multi=False because I want to do 1/0 classification.

With this setup my final layer looks like this:
A.) output of learn.model:

  (outp): Linear(in_features=500, out_features=1, bias=True)
  (emb_drop): Dropout(p=0.4)
  (drops): ModuleList(
    (0): Dropout(p=0.001)
    (1): Dropout(p=0.01)

B.) output of learn.crit:
<function torch.nn.functional.nll_loss(input, target, weight=None, size_average=True, ignore_index=-100, reduce=True)>

Has somebody an idea where I should look for the error?

One thing I found out about " device-side assert triggered " is to look at the console where you started jupyter notebook, there will be a much more detailed message, e.g.:

27

In this example, the messages are Indicating that some tensors are of incompatible size (correction: actually, the error message says that the values inside the tensor should be between 0.0 and 1.0). Since it also includes the shapes of the tensors, it makes it easier to track them down in your code.

1 Like

Hello @gai,
thank you for the hint!
Now I have the full error message:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/THCTensorCopy.cu line=204 error=59 : device-side assert triggered
/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 1, SrcDim = 1, IdxDim = -2, IndexIsMajor = true]: block: [0,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

This seems very similar to the error in this post Structured Learner which was not solved yet.
From the error description it looks like a dimension mismatch. I guess I have to check my df setup again.
I will keep digging deeper! :smiley: