Structured Learner

This is looking great @kcturgutlu ! Let me know when you’ve got something more polished since I’d love to be able to share this work widely :slight_smile: FYI your nb link above is a 404. Correct link seems to be https://github.com/KeremTurgutlu/deeplearning/blob/master/avazu/FAST.AI%20Binary%20Classification%20-%20Kaggle%20Avazu%20CTR.ipynb

2 Likes

Thanks for the reminder, I’ve changed the link. I am working on DSBOWL 2018 and USCF simultaneously right now since task is very similar:) But I will probably be able to optimize the work and polish it as you recommend in couple of days and let you know. Thank you so much !

SIDE NOTE: I didn’t realize how computationally expensive encoder decoder CNNs are before actually running one :slight_smile:

1 Like

Thanks @kcturgutlu!! I’ll definitely try it!

I finally had time to update the notebook, here is the link https://github.com/KeremTurgutlu/deeplearning/blob/master/avazu/FAST.AI%20Classification%20-%20Kaggle%20Avazu%20CTR.ipynb. Sorry for late reply :slight_smile:

6 Likes

Thank you! Could you tell me how to send class weights to the loss function?

I tried the following after reviewing the documentation with no success. I don’t think passing input and target values are possible/useful

----> 5 learn.crit = F.cross_entropy(weight=[.1,.99])
      6 learn.crit

TypeError: cross_entropy() missing 2 required positional arguments: 'input' and 'target'

Can you point me towards the right place to set the weights to overcome class imbalance issues?

1 Like

Getting RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCCachingHostAllocator.cpp:258

Been struggling with this for quite long now while doing learn.lr_find()
. Could you please help.

Looking in the logs for jupyter shows that :
block: [0,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.

in forward(self, x_cat, x_cont)
26 if self.n_emb != 0:
27 x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
—> 28 x = torch.cat(x, 1)
29 x = self.emb_drop(x)
30 if self.n_cont != 0:

2 Likes

Hey, Thanks for sharing the link.

I was following your notebook for the classification task. i’m getting this error. Can you please help me figure out what could be the reason?



Do you have x_conts ? Can you try to access it through trn_ds and show what you are getting for x_cont

No. there isn’t any continuous variable in the data. All the categorical. I’m participating in this competition:

Please pull my latest notebook problem is with batchnorm, you don’t have the condition if self.n_cont != 0.

Correct Model Class:

class MixedInputModel(nn.Module):
    def __init__(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops,
                 y_range=None, use_bn=False):
        super().__init__()
        self.embs = nn.ModuleList([nn.Embedding(c, s) for c,s in emb_szs])
        for emb in self.embs: emb_init(emb)
        n_emb = sum(e.embedding_dim for e in self.embs)
        self.n_emb, self.n_cont=n_emb, n_cont
        
        szs = [n_emb+n_cont] + szs
        self.lins = nn.ModuleList([
            nn.Linear(szs[i], szs[i+1]) for i in range(len(szs)-1)])
        self.bns = nn.ModuleList([
            nn.BatchNorm1d(sz) for sz in szs[1:]])
        for o in self.lins: kaiming_normal(o.weight.data)
        self.outp = nn.Linear(szs[-1], out_sz)
        kaiming_normal(self.outp.weight.data)

        self.emb_drop = nn.Dropout(emb_drop)
        self.drops = nn.ModuleList([nn.Dropout(drop) for drop in drops])
        self.bn = nn.BatchNorm1d(n_cont)
        self.use_bn,self.y_range = use_bn,y_range

    def forward(self, x_cat, x_cont):
        if self.n_emb != 0:
            x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
            x = torch.cat(x, 1)
            x = self.emb_drop(x)
        if self.n_cont != 0:
            x2 = self.bn(x_cont)
            x = torch.cat([x, x2], 1) if self.n_emb != 0 else x2
        for l,d,b in zip(self.lins, self.drops, self.bns):
            x = F.relu(l(x))
            if self.use_bn: x = b(x)
            x = d(x)
        x = self.outp(x)
        if self.y_range:
            x = F.sigmoid(x)
            x = x*(self.y_range[1] - self.y_range[0])
            x = x+self.y_range[0]
        return x

And let me now how it scores on LB :wink:

One more thing you can do is to use Factorization Machines and compare it with embeddings mehtod. Use https://www.csie.ntu.edu.tw/~r01922136/libffm/

4 Likes

What change was made to ColumnarDataset?
I see the one commented line but it looks the same as fastai version except your y input is a df and fastai uses np.array.
And what change to make it multiple classification?
I have similar setup for different dataset and have this error:

RuntimeError: multi-target not supported at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THNN/generic/ClassNLLCriterion.c:22

This is running learn.fit
Which seems to be some dimension error in the target(y) somewhere. The dimension was BatchSize x 1 with each an int for the category.

I also wanted to use the Categorical Data models for a classification rather. I got it to work by doing the following:

1 Make sure that the dependent variable is converted to integer
2 change the loss function in the structured learner to self.crit = F.nll_loss
3 Change the last layer of the mixed model to be x = F.log_softmax(x)

The above works with multi-class problems and hence I prefer it to binary cross entropy. It also avoid you having to one hot encode the dep var.

I would like to make the the ColumularDataset, ColumularModelData, Structured Learner and MixedInputModel all able to accept either type of input but haven’t got around to that yet.

4 Likes

I’m also running into this problem, however, when I run with @johnri99 's changes as above, I’m getting:
RuntimeError: multi-target not supported at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu:16 as gambit50 was also.

I’ve tried a few different loss functions (incl CrossEntropyLoss, MultiLabelSoftMarginLoss) without much success. I keep getting type mismatch errors or these weird RuntimeError: cuda runtime error (59) errors…

Here are my classes:

class StructuredLearner(Learner):
def __init__(self, data, models, **kwargs):
    super().__init__(data, models, **kwargs)
    if self.models.model.classify:
        self.crit = nn.MultiLabelSoftMarginLoss
    else: self.crit = nn.MultiLabelSoftMarginLoss


class MixedInputModel(nn.Module):
    def __init__(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops, y_range=None, use_bn=False, classify=True):
        super().__init__() ## inherit from nn.Module parent class
        self.embs = nn.ModuleList([nn.Embedding(m, d) for m, d in emb_szs]) ## construct embeddings
        for emb in self.embs: emb_init(emb) ## initialize embedding weights
        n_emb = sum(e.embedding_dim for e in self.embs) ## get embedding dimension needed for 1st layer
        szs = [n_emb+n_cont] + szs ## add input layer to szs
        self.lins = nn.ModuleList([
            nn.Linear(szs[i], szs[i+1]) for i in range(len(szs)-1)]) ## create linear layers input, l1 -> l1, l2 ...
        self.bns = nn.ModuleList([
            nn.BatchNorm1d(sz) for sz in szs[1:]]) ## batchnormalization for hidden layers activations
        for o in self.lins: kaiming_normal(o.weight.data) ## init weights with kaiming normalization
        self.outp = nn.Linear(szs[-1], out_sz) ## create linear from last hidden layer to output
        kaiming_normal(self.outp.weight.data) ## do kaiming initialization
        
        self.emb_drop = nn.Dropout(emb_drop) ## embedding dropout, will zero out weights of embeddings
        self.drops = nn.ModuleList([nn.Dropout(drop) for drop in drops]) ## fc layer dropout
        self.bn = nn.BatchNorm1d(n_cont) # bacthnorm for continous data
        self.use_bn,self.y_range = use_bn,y_range 
        self.classify = classify
        
    def forward(self, x_cat, x_cont):
        x = [emb(x_cat[:, i]) for i, emb in enumerate(self.embs)] # takes necessary emb vectors 
        x = torch.cat(x, 1) ## concatenate along axis = 1 (columns - side by side) # this is our input from cats
        x = self.emb_drop(x) ## apply dropout to elements of embedding tensor
        x2 = self.bn(x_cont) ## apply batchnorm to continous variables
        x = torch.cat([x, x2], 1) ## concatenate cats and conts for final input
        for l, d, b in zip(self.lins, self.drops, self.bns):
            x = F.relu(l(x)) ## dotprod + non-linearity
            if self.use_bn: x = b(x) ## apply batchnorm activations
            x = d(x) 
        x = self.outp(x) 
        return x 

Adapted from: https://github.com/groverpr/deep-learning/blob/master/taxi/taxi3.ipynb

That error message usually occurs when you have one hot encoded the target, which you don’t need to do with nlll_loss.

Will have a more thorough look but that would be my first thought

1 Like

How are you tackling the imbalance in the dataset?

I was just about to fork the library to incorporate the ability to deal with categorical data when I found that whilst I had been thinking about it Vinod Kumar Reddy Gandra has just actually done the same. Nice work Vinod, slightly disappointed as it would have been a good chance to work thorough contributing to an open source project but I’m sure there will be other chances.

It looks as though there is now a parameter to be set when instantiating the ColumnModelDat to tell the system what type of analysis is needed. The parameter ‘is_reg’ should be set to True for regression and False for catagorical.

5 Likes

I tried passing ‘y’ as both shape (N, ) and (N, 1) where N is the number of samples and each value is an integer in range (0,4) and range (1,5) [5 classes in my data]. And I get the same error in each situation. What am I missing?

range(0,4) should work. What’s the size of your embeddings? Make sure that you’re including the 0 (max(range)+1). If your C isn’t 5 in the embedding then that’s likely your issue.

Thanks. This is how I decide my embeddings size:
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

I’m not sure what you mean by “If your C isn’t 5 in the embedding then that’s likely your issue”, since isn’t the c for the embedding size different from the number of classes I’m trying to identify? I thought the C in the embedding size is just a function of how many different categories that specific category had.

The model does run without error if I change the loss to mse_loss and the target to np.float32, which obviously that is not the best way to do classification. But, that does run…

Looking at your example above you are using MultiLabelSoftMarginLoss. From looking at Pytorch documentation this requires one hot encoding of the target, as compared to NLL_Loss, which requires a (N,C) shape. Have you tried with a simple NLL_Loss function, I have no problem getting this to work using the latest version of the column_data.py, which lets you define classification instead of regression, and then uses NLL_Loss. The target can be supplied to the model data as an simple integer array.

Apologies if the example above is out of date, please ignore if that is the case.