Cuda runtime error (59)

Fighting with this for the last hour …

The error message:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/generated/../generic/THCTensorMathReduce.cu:18

Working on toxic comments competition and trying to define a very simple LSTM to predict just one of the categories (“toxic”). I had to define a custom dataset class and also a modified version of TextData because the fast.ai one has calls to build_vocab for both the label and text field that I don’t want. I can post all that code (if anyone is interested), but I think the problem has something to do with my simple LSTM below.

I’ve defined my text and label fields as such

tt_TEXT = data.Field(sequential=True, tokenize=tokenizer, fix_length=100)
tt_LABEL = data.Field(sequential=False, use_vocab=False)

splits = ToxicDataset.splits(tt_TEXT, tt_LABEL, train_df, 'comment_text', 'toxic', val_df, None)
tt_TEXT.build_vocab(splits[0], max_size=20000)

My batches look like this:

b = next(iter(md.trn_dl))
b[0].size(), b[1].size()

Returns: (torch.Size([100, 64]), torch.Size([64]))

Here is my model:

class LstmClassifier(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl=1, out_sz=1):
        super().__init__()
        
        self.vocab_size, self.nl, self.out_sz = vocab_size, nl, out_sz
        
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, out_sz)
        
        self.h = self.init_hidden(bs)
        
    def forward(self, words):
        
        outp, h = self.rnn(self.e(words), self.h)
        self.h = repackage_var(h)
        preds = self.l_out(outp[-1])
      
        return F.log_softmax(preds)
    
    def init_hidden(self, bs):
        return(V(torch.zeros(self.nl, bs, n_hidden)), V(torch.zeros(self.nl, bs, n_hidden)))

m = LstmClassifier(md.nt, n_fac, bsz, 2, 1).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

Here is where I hit the error:
fit(m, md, 6, opt, F.nll_loss)

For anyone bold enough to have read this far, any ideas on what I may have pooched?

1 Like

Here is where it is blowing up …

I am also a noob in PyTorch (although learning and improving everyday), but I can try because I had similar error yesterday while I was making a customized model.

• I think sometimes it goes away when we restart the kernel (don’t know why).

• Another thing that might be happening is discrepancy in indexes and sizes of different layers . I think doing b = next(iter(md.trn_dl)) ; m(V(b) and then looking at layers might help. But try restarting kernel first and remember to do m.cuda() before fitting model on GPU

1 Like

Also be sure to input continuous integers (0-n) as input in embedding layer. Could do something like

for c in cats: data[c].replace({val: i for i, val in enumerate(data[c].unique())}, inplace=True)
for categorical features

  1. Restart didn’t fix anything.

  2. running m(V(b)) did return this interesting error (I’ve seen this before but not sure what is causing):

Actually I forgot that the batch is a tuple.

When I ran m(V(b[0])) it worked just fine.

I reviewed the indexes from my tt_TEXT.vocab and they are actually all contiguous so I don’t think the embedding layer is the problem :frowning:

If the model is running for 1 minibatch (i.e. m(Vb[0])) works) , it should have worked when called in fit too. Maybe write your own training loop.

About embedding layer, I thought that you passed categorical variables into that and I meant that those categories should be converted into integral category codes (numbers).

1 Like

Good tip from @groverpr . Since you can run the net, the problem must be the loss function. Try passing the result of your m(…) function to your loss function.

2 Likes

Yup. That is where it is blowing up

b = next(iter(md.val_dl))
p = m(V(b[0]))

F.nll_loss(p, V(b[1]))  <---

I’m trying to predict either 1 or 0, and the sizes are not correct:

b[1].size(), b[0].size(), p.size(), tt_LABEL.vocab.itos

(torch.Size([64]),
 torch.Size([100, 64]),
 torch.Size([64, 2]),
 ['<unk>', '0', '1'])

So it seems that it is an error in my data preparation, but I’m not sure how to resolve this in my custom dataset class which looks like this:

Does the lbl variable in init have to be one hot encoded?

I’m more confused now that I’ve gone back and looked at how the custom dataset is being built in the lang_model-arxiv notebook. There isn’t any OHE of the target variable and yet it works.

My model seems to output what I expect … two numbers, one for “0” and the other for “1”.

The problem looks like the dimensions of my “label”, which comes back for each batch as torch.Size([64]) instead of torch.Size([64,2]).

But why? Why does the lang_model_arxiv notebook work without OHE though it is trying to solve the same type of task I am?

pytorch cross entropy loss et al don’t expect one hot encoded target. We only one hot encode target if there are multiple labels per row (in which case we can’t use cross entropy loss anyway).

1 Like

So what am I doing wrong here? :slight_smile:

Am I using the wrong loss function here? My model will actually return 3 outputs (, 0, 1) but my target will be a single value (e.g., 0 or 1). I thought I could use F.nll_loss but I get the weird cuda exceptoin between the the size mismatch (64x3 vs. 64).

Actually … hold everything.

Switched to us F.cross_entropy and everything is running.

I’m glad but … why does this work?

It seems like the targets would need to be one hot encoded to represent one of the 3 classes (, 0, 1) … but they don’t. I don’t get that.

The jigsaw dataset is a multi-label dataset. So you need to use binary cross entropy, and a one-hot encoded target. So even if it’s working, I’m not sure you’ll get great results, since some rows have multiple "1"s.

That is why I was thinking about creating a classifier for each class. I was starting with the “toxic” class just to see if I could build a working model.

Ah that makes sense. Well yes then categorical x ent is what you want, since you’ve got a categorical dependent encoded as an int and pre-softmax predictions.

First, thanks to @groverpr and @jeremy for looking at this and providing some key insights that helped me figure out where I was in error! The recommendation to run grab a batch, run it through the model, and then run the model output through the loss function by hand really helped.

In my case, it had to do with a misunderstanding of how/where the categoricals were encoded.

I couldn’t for the life of me figure out where “0” and “1” were being encoded as “1” and “2” based on my label field’s values of ("", “0”, “1”) … until I ran this:

train_iter = data.BucketIterator(splits[0], batch_size=32, device=0, sort_key=lambda ex: len(ex.text))
vars(next(iter(train_iter)))

BucketIterator is what does the categorical embedding and the reason you don’t need to OHE anything. I had no idea.

Once I figured this out it made sense why I needed to change my output size from 2 to 3 (e.g., md.c).

2 Likes

What is the right format for one hot encoding the target for a multi-label classification problem?

I have this:

lbl = list(row[lbl_cols].values)

… which basically sets the lbl = something like [1, 0, 0, 0, 0, 0, 0]

Everything works fine until I actually try to run my model using the fast.ai fit() method. I get the following exception:

int() argument must be a string, a bytes-like object or a number, not 'list'

I’m pretty sure its probably a problem with how I’m trying to setup my labels to be OHE, but not sure. Here is what I think the relevant trace:

Have a look at the planet notebook to see how multi-label classification works.

I looked at it and I think I’m setting my labels correctly as a numpy array of type float32 (it matches what I see in the lesson 2 planet notebook at least).

However, now I receive this error when I try to iterate over a single mini-batch:

x,y = next(iter(md.val_dl))
>>> only length-1 arrays can be converted to Python scalars

My label torchtext field is defined as such:

tt_LABEL = data.Field(sequential=False, use_vocab=False)

I think I’m missing how to define the tortchtext field and/or assign it’s values for a multi-lable problem … but not sure what to try next.

Any ideas? I’m trying to use as much as I can with the code in nlp.py, but maybe it (or torchtext field objects) have some limitations re: multi-label problems.

Thanks.