Cuda runtime error (59)

wgpubs · December 21, 2017, 4:52am

Fighting with this for the last hour …

The error message:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/generated/../generic/THCTensorMathReduce.cu:18

Working on toxic comments competition and trying to define a very simple LSTM to predict just one of the categories (“toxic”). I had to define a custom dataset class and also a modified version of TextData because the fast.ai one has calls to build_vocab for both the label and text field that I don’t want. I can post all that code (if anyone is interested), but I think the problem has something to do with my simple LSTM below.

I’ve defined my text and label fields as such

tt_TEXT = data.Field(sequential=True, tokenize=tokenizer, fix_length=100)
tt_LABEL = data.Field(sequential=False, use_vocab=False)

splits = ToxicDataset.splits(tt_TEXT, tt_LABEL, train_df, 'comment_text', 'toxic', val_df, None)
tt_TEXT.build_vocab(splits[0], max_size=20000)

My batches look like this:

b = next(iter(md.trn_dl))
b[0].size(), b[1].size()

Returns: (torch.Size([100, 64]), torch.Size([64]))

Here is my model:

class LstmClassifier(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl=1, out_sz=1):
        super().__init__()
        
        self.vocab_size, self.nl, self.out_sz = vocab_size, nl, out_sz
        
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, out_sz)
        
        self.h = self.init_hidden(bs)
        
    def forward(self, words):
        
        outp, h = self.rnn(self.e(words), self.h)
        self.h = repackage_var(h)
        preds = self.l_out(outp[-1])
      
        return F.log_softmax(preds)
    
    def init_hidden(self, bs):
        return(V(torch.zeros(self.nl, bs, n_hidden)), V(torch.zeros(self.nl, bs, n_hidden)))

m = LstmClassifier(md.nt, n_fac, bsz, 2, 1).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

Here is where I hit the error:
fit(m, md, 6, opt, F.nll_loss)

For anyone bold enough to have read this far, any ideas on what I may have pooched?

wgpubs · December 21, 2017, 5:04am

Here is where it is blowing up …

groverpr · December 21, 2017, 5:09am

I am also a noob in PyTorch (although learning and improving everyday), but I can try because I had similar error yesterday while I was making a customized model.

• I think sometimes it goes away when we restart the kernel (don’t know why).

• Another thing that might be happening is discrepancy in indexes and sizes of different layers . I think doing b = next(iter(md.trn_dl)) ; m(V(b) and then looking at layers might help. But try restarting kernel first and remember to do m.cuda() before fitting model on GPU

groverpr · December 21, 2017, 5:11am

Also be sure to input continuous integers (0-n) as input in embedding layer. Could do something like

for c in cats: data[c].replace({val: i for i, val in enumerate(data[c].unique())}, inplace=True)
for categorical features

wgpubs · December 21, 2017, 5:20am

Restart didn’t fix anything.
running m(V(b)) did return this interesting error (I’ve seen this before but not sure what is causing):

wgpubs · December 21, 2017, 5:45am

Actually I forgot that the batch is a tuple.

When I ran m(V(b[0])) it worked just fine.

I reviewed the indexes from my tt_TEXT.vocab and they are actually all contiguous so I don’t think the embedding layer is the problem

groverpr · December 21, 2017, 6:32am

If the model is running for 1 minibatch (i.e. m(Vb[0])) works) , it should have worked when called in fit too. Maybe write your own training loop.

About embedding layer, I thought that you passed categorical variables into that and I meant that those categories should be converted into integral category codes (numbers).

jeremy · December 21, 2017, 11:34am

Good tip from @groverpr . Since you can run the net, the problem must be the loss function. Try passing the result of your m(…) function to your loss function.

wgpubs · December 21, 2017, 5:17pm

Yup. That is where it is blowing up

b = next(iter(md.val_dl))
p = m(V(b[0]))

F.nll_loss(p, V(b[1]))  <---

I’m trying to predict either 1 or 0, and the sizes are not correct:

b[1].size(), b[0].size(), p.size(), tt_LABEL.vocab.itos

(torch.Size([64]),
 torch.Size([100, 64]),
 torch.Size([64, 2]),
 ['<unk>', '0', '1'])

So it seems that it is an error in my data preparation, but I’m not sure how to resolve this in my custom dataset class which looks like this:

Does the lbl variable in init have to be one hot encoded?

wgpubs · December 21, 2017, 7:00pm

I’m more confused now that I’ve gone back and looked at how the custom dataset is being built in the lang_model-arxiv notebook. There isn’t any OHE of the target variable and yet it works.

My model seems to output what I expect … two numbers, one for “0” and the other for “1”.

The problem looks like the dimensions of my “label”, which comes back for each batch as torch.Size([64]) instead of torch.Size([64,2]).

But why? Why does the lang_model_arxiv notebook work without OHE though it is trying to solve the same type of task I am?

jeremy · December 21, 2017, 7:08pm

pytorch cross entropy loss et al don’t expect one hot encoded target. We only one hot encode target if there are multiple labels per row (in which case we can’t use cross entropy loss anyway).

wgpubs · December 21, 2017, 7:13pm

So what am I doing wrong here?

Am I using the wrong loss function here? My model will actually return 3 outputs (, 0, 1) but my target will be a single value (e.g., 0 or 1). I thought I could use F.nll_loss but I get the weird cuda exceptoin between the the size mismatch (64x3 vs. 64).

wgpubs · December 21, 2017, 7:21pm

Actually … hold everything.

Switched to us F.cross_entropy and everything is running.

I’m glad but … why does this work?

It seems like the targets would need to be one hot encoded to represent one of the 3 classes (, 0, 1) … but they don’t. I don’t get that.

jeremy · December 21, 2017, 7:22pm

The jigsaw dataset is a multi-label dataset. So you need to use binary cross entropy, and a one-hot encoded target. So even if it’s working, I’m not sure you’ll get great results, since some rows have multiple "1"s.

wgpubs · December 21, 2017, 7:24pm

That is why I was thinking about creating a classifier for each class. I was starting with the “toxic” class just to see if I could build a working model.

jeremy · December 21, 2017, 7:26pm

Ah that makes sense. Well yes then categorical x ent is what you want, since you’ve got a categorical dependent encoded as an int and pre-softmax predictions.

wgpubs · December 21, 2017, 9:30pm

First, thanks to @groverpr and @jeremy for looking at this and providing some key insights that helped me figure out where I was in error! The recommendation to run grab a batch, run it through the model, and then run the model output through the loss function by hand really helped.

In my case, it had to do with a misunderstanding of how/where the categoricals were encoded.

I couldn’t for the life of me figure out where “0” and “1” were being encoded as “1” and “2” based on my label field’s values of ("", “0”, “1”) … until I ran this:

train_iter = data.BucketIterator(splits[0], batch_size=32, device=0, sort_key=lambda ex: len(ex.text))
vars(next(iter(train_iter)))

BucketIterator is what does the categorical embedding and the reason you don’t need to OHE anything. I had no idea.

Once I figured this out it made sense why I needed to change my output size from 2 to 3 (e.g., md.c).

wgpubs · December 24, 2017, 11:40pm

What is the right format for one hot encoding the target for a multi-label classification problem?

I have this:

lbl = list(row[lbl_cols].values)

… which basically sets the lbl = something like [1, 0, 0, 0, 0, 0, 0]

Everything works fine until I actually try to run my model using the fast.ai fit() method. I get the following exception:

int() argument must be a string, a bytes-like object or a number, not 'list'

I’m pretty sure its probably a problem with how I’m trying to setup my labels to be OHE, but not sure. Here is what I think the relevant trace:

jeremy · December 25, 2017, 5:51am

Have a look at the planet notebook to see how multi-label classification works.

wgpubs · December 26, 2017, 8:17am

I looked at it and I think I’m setting my labels correctly as a numpy array of type float32 (it matches what I see in the lesson 2 planet notebook at least).

However, now I receive this error when I try to iterate over a single mini-batch:

x,y = next(iter(md.val_dl))
>>> only length-1 arrays can be converted to Python scalars

My label torchtext field is defined as such:

tt_LABEL = data.Field(sequential=False, use_vocab=False)

I think I’m missing how to define the tortchtext field and/or assign it’s values for a multi-lable problem … but not sure what to try next.

Any ideas? I’m trying to use as much as I can with the code in nlp.py, but maybe it (or torchtext field objects) have some limitations re: multi-label problems.

Thanks.