Cuda runtime error (59)

wgpubs · December 27, 2017, 3:50am

The more I fight with things, the more I figure stuff out.

Correct me if I’m wrong, but as to how to use torchtext to solve a multi-label classification problem, the solution is to define one torchtext.data.Field object for every label … and then use F.sigmoid as my final non-linearity and F.binary_cross_entropy as my loss function.

This means I can’t use the fast.ai TextData or TextDataLoader objects as is (although they were instructive as to what I needed to do for the multi-label dataset).

Now my question is, how to convert these values to probabilities?

ziron · February 3, 2018, 11:14pm

@wgpubs
Thank you so much for sharing the code to create multi label data in TorchText
I am trying to solve the Toxic comment challenge on kaggle.

As per my understanding, binary_cross_entropy should be used as the criterion in RNN_Learner (Please correct me if I am wrong). When I use the binary_cross_entropy as criterion, fit model throws an error.

   1177             weight = Variable(weight)
   1178 
-> 1179     return torch._C._nn.binary_cross_entropy(input, target, weight, size_average)
   1180 
   1181 

RuntimeError: reduce failed to synchronize: device-side assert triggered

I am unable to figure out this Runtime error.
Did you face this in your experiments?

Thanks again.

wgpubs · February 4, 2018, 2:32am

Yes, binary cross entropy is the right loss function for multi-label problems.

Unfortunately, I’m not familiar with that error. You may want to try restarting your notebook or even your machine. Let us know if you figure it out.

hiromi · February 25, 2018, 12:37am

I actually came across the same issue. I know it is due to the negative numbers in the output tensor because the error goes away if I take the absolute value of the output like this in model.py:

    def step(self, xs, y):
        xtra = []
        output = self.m(*xs)
        if isinstance(output,(tuple,list)): output,*xtra = output
        self.opt.zero_grad()
        output = output.abs()  #<<------ This thing. BUT DON'T DO THIS
        print(output)

But of course we do NOT want to do that. So I’ll investigate further and will let you know.

kishb87 · March 1, 2018, 3:05am

Any luck with this?

hiromi · March 1, 2018, 3:41am

Yep, make sure that you have a sigmoid layer at the end to get rid of negative predictions. binary_cross_entropy doesn’t like negative numbers.

kishb87 · March 1, 2018, 3:49am

Ah, that makes sense! Thank you Hiromi!

feribg · April 4, 2018, 6:53am

Not sure if this is only a result of negative values. For example I’m using ImageClassifier.from_csv and is_multi is false so indeed the outputs have negative values but the last layer is LogSoftmax hence only positives should be returned, further the loss is nll_loss, yet I get the exact same error.

hiromi · April 4, 2018, 12:13pm

I think LogSoftmax does return negative because I looked up the definition and it’s this:

softmax:

exp(x_i) / exp(x).sum()

log_softmax:

log( exp(x_i) / exp(x).sum() )

So say, softmax returns 0.1, then LogSoftmax is -1.

Maybe you can try printing out the activations before you call the loss function to double check?

feribg · April 4, 2018, 4:01pm

Yep you’re right that’s indeed the case. So how did you go about solving it, if I just add a sigmoid on top after the last logSoftmax activation, things seemingly work but my losses are negative and learning rate finder stops working.

hiromi · April 5, 2018, 1:22pm

Is there a reason you are using LogSoftmax over Softmax? Softmax is good for choosing one thing at the end. Sigmoid allows multiple activations to be big. So, if you are okay with picking one thing at the end, I’d replace LogSoftmax with Softmax. That’ll make everything between 0 and 1. If you want to choose multiple things, I’d consider using Sigmoid instead of LogSoftmax.

Hope that helps!

feribg · April 6, 2018, 6:11am

Thanks for the feedback. I’m just using the default pre trained resnet50, which puts a logsoftmax at the end for single class classification. This one is for the humpback whales, which have about 4000 classes. The default last layer is

('LogSoftmax-184',
              OrderedDict([('input_shape', [-1, 4250]),
                           ('output_shape', [-1, 4250]),
                           ('nb_params', 0)]))])

that works fine for lr_find, but crashes with the above mentioned error when trying to train.

If I change the last layer to

('Softmax-184',
              OrderedDict([('input_shape', [-1, 4250]),
                           ('output_shape', [-1, 4250]),
                           ('nb_params', 0)]))])

things seem to break, lr_find plots an empty chart and the values look out of line:
9%|▉ | 12/131 [00:01<00:15, 7.62it/s, loss=-0.000223]
Still can’t seem to find my way around this one.

nok · May 4, 2018, 6:53am

Thanks! I spent days on debugging as the error message doesn’t give much useful information, and the whole kernel crash after a device-side assert trigger and I have to re-run the whole thing…

I am slightly confused here, what was the output of the language model? The default criterion is cross_entropy, therefore, I expect I can just switch cross_entropy to binary_cross_entropy without changing other things of the model and change the target as a one-hot target.

According to PyTorch Documentation, F.cross_entropy combine log_softmax and nll_loss. On the other side, F.binary_cross_entropy is expecting an input after sigmoid layer? Does that mean cross_entropy is actually 2 layers that you can applied after a linear layer, while binaray_cross_entropy should be applied after a sigmoid layer?

So I just notice that, if I do it in PyTorch 0.4.0, it will throws NaN instead of device-assert error