Structured Learner

gambit50 · March 26, 2018, 7:37pm

What change was made to ColumnarDataset?
I see the one commented line but it looks the same as fastai version except your y input is a df and fastai uses np.array.
And what change to make it multiple classification?
I have similar setup for different dataset and have this error:

RuntimeError: multi-target not supported at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THNN/generic/ClassNLLCriterion.c:22

This is running learn.fit
Which seems to be some dimension error in the target(y) somewhere. The dimension was BatchSize x 1 with each an int for the category.

johnri99 · March 29, 2018, 7:37am

I also wanted to use the Categorical Data models for a classification rather. I got it to work by doing the following:

1 Make sure that the dependent variable is converted to integer
2 change the loss function in the structured learner to self.crit = F.nll_loss
3 Change the last layer of the mixed model to be x = F.log_softmax(x)

The above works with multi-class problems and hence I prefer it to binary cross entropy. It also avoid you having to one hot encode the dep var.

I would like to make the the ColumularDataset, ColumularModelData, Structured Learner and MixedInputModel all able to accept either type of input but haven’t got around to that yet.

joshgel · March 30, 2018, 2:49pm

I’m also running into this problem, however, when I run with @johnri99 's changes as above, I’m getting:
RuntimeError: multi-target not supported at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu:16 as gambit50 was also.

I’ve tried a few different loss functions (incl CrossEntropyLoss, MultiLabelSoftMarginLoss) without much success. I keep getting type mismatch errors or these weird RuntimeError: cuda runtime error (59) errors…

Here are my classes:

class StructuredLearner(Learner):
def __init__(self, data, models, **kwargs):
    super().__init__(data, models, **kwargs)
    if self.models.model.classify:
        self.crit = nn.MultiLabelSoftMarginLoss
    else: self.crit = nn.MultiLabelSoftMarginLoss


class MixedInputModel(nn.Module):
    def __init__(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops, y_range=None, use_bn=False, classify=True):
        super().__init__() ## inherit from nn.Module parent class
        self.embs = nn.ModuleList([nn.Embedding(m, d) for m, d in emb_szs]) ## construct embeddings
        for emb in self.embs: emb_init(emb) ## initialize embedding weights
        n_emb = sum(e.embedding_dim for e in self.embs) ## get embedding dimension needed for 1st layer
        szs = [n_emb+n_cont] + szs ## add input layer to szs
        self.lins = nn.ModuleList([
            nn.Linear(szs[i], szs[i+1]) for i in range(len(szs)-1)]) ## create linear layers input, l1 -> l1, l2 ...
        self.bns = nn.ModuleList([
            nn.BatchNorm1d(sz) for sz in szs[1:]]) ## batchnormalization for hidden layers activations
        for o in self.lins: kaiming_normal(o.weight.data) ## init weights with kaiming normalization
        self.outp = nn.Linear(szs[-1], out_sz) ## create linear from last hidden layer to output
        kaiming_normal(self.outp.weight.data) ## do kaiming initialization
        
        self.emb_drop = nn.Dropout(emb_drop) ## embedding dropout, will zero out weights of embeddings
        self.drops = nn.ModuleList([nn.Dropout(drop) for drop in drops]) ## fc layer dropout
        self.bn = nn.BatchNorm1d(n_cont) # bacthnorm for continous data
        self.use_bn,self.y_range = use_bn,y_range 
        self.classify = classify
        
    def forward(self, x_cat, x_cont):
        x = [emb(x_cat[:, i]) for i, emb in enumerate(self.embs)] # takes necessary emb vectors 
        x = torch.cat(x, 1) ## concatenate along axis = 1 (columns - side by side) # this is our input from cats
        x = self.emb_drop(x) ## apply dropout to elements of embedding tensor
        x2 = self.bn(x_cont) ## apply batchnorm to continous variables
        x = torch.cat([x, x2], 1) ## concatenate cats and conts for final input
        for l, d, b in zip(self.lins, self.drops, self.bns):
            x = F.relu(l(x)) ## dotprod + non-linearity
            if self.use_bn: x = b(x) ## apply batchnorm activations
            x = d(x) 
        x = self.outp(x) 
        return x

Adapted from: https://github.com/groverpr/deep-learning/blob/master/taxi/taxi3.ipynb

johnri99 · March 31, 2018, 10:30pm

That error message usually occurs when you have one hot encoded the target, which you don’t need to do with nlll_loss.

Will have a more thorough look but that would be my first thought

palash · April 2, 2018, 3:01pm

How are you tackling the imbalance in the dataset?

johnri99 · April 2, 2018, 6:23pm

I was just about to fork the library to incorporate the ability to deal with categorical data when I found that whilst I had been thinking about it Vinod Kumar Reddy Gandra has just actually done the same. Nice work Vinod, slightly disappointed as it would have been a good chance to work thorough contributing to an open source project but I’m sure there will be other chances.

It looks as though there is now a parameter to be set when instantiating the ColumnModelDat to tell the system what type of analysis is needed. The parameter ‘is_reg’ should be set to True for regression and False for catagorical.

joshgel · April 3, 2018, 1:33am

I tried passing ‘y’ as both shape (N, ) and (N, 1) where N is the number of samples and each value is an integer in range (0,4) and range (1,5) [5 classes in my data]. And I get the same error in each situation. What am I missing?

Even · April 4, 2018, 2:10am

range(0,4) should work. What’s the size of your embeddings? Make sure that you’re including the 0 (max(range)+1). If your C isn’t 5 in the embedding then that’s likely your issue.

joshgel · April 4, 2018, 2:26pm

Thanks. This is how I decide my embeddings size:
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

I’m not sure what you mean by “If your C isn’t 5 in the embedding then that’s likely your issue”, since isn’t the c for the embedding size different from the number of classes I’m trying to identify? I thought the C in the embedding size is just a function of how many different categories that specific category had.

The model does run without error if I change the loss to mse_loss and the target to np.float32, which obviously that is not the best way to do classification. But, that does run…

johnri99 · April 4, 2018, 7:04pm

Looking at your example above you are using MultiLabelSoftMarginLoss. From looking at Pytorch documentation this requires one hot encoding of the target, as compared to NLL_Loss, which requires a (N,C) shape. Have you tried with a simple NLL_Loss function, I have no problem getting this to work using the latest version of the column_data.py, which lets you define classification instead of regression, and then uses NLL_Loss. The target can be supplied to the model data as an simple integer array.

Apologies if the example above is out of date, please ignore if that is the case.

ronlut · April 4, 2018, 10:25pm

Amazing work @kcturgutlu and @johnri99, thanks for sharing your path to success.

What should one do to achieve a multi-label output?
Is setting out_sz should suffice?

Another thing, I’m getting exceptions at ClassNLLCriterion.cu even before trying multi-label, only multiclass:

ClassNLLCriterion.cu:101: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, 
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [24,0,0] 
Assertion `t >= 0 && t < n_classes` failed.

I’m guessing it has something to do with the output or the loss function?

My code looks like this:

y = df.label.apply(lambda l: int(float(l))) # labels are originally decimal, shape of y is: (49513,)
df.drop('label', axis=1, inplace=True) # shape of df is: (49513, 2298)
val_idx = get_cv_idxs(len(df), val_pct=0.1)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y.values, cat_flds=[], bs=128, is_reg=False) 
# I have no categorical variables that's why cat_flds=[]
m = md.get_learner(emb_szs=[], n_cont=len(df.columns), emb_drop=0.04, out_sz=1, szs= 
[1000,500], drops=[0.001,0.01])

Then I’m getting the above exception (ClassNLLCriterion) followed by THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCTensorMath.cu line=15 error=59 : device-side assert triggered.

Do you have any idea what am I doing wrong?
Thanks a lot!

johnri99 · April 5, 2018, 2:16pm

Hi Rony,

Out_size = 1 would work with cross entropy but I think NLLL needs output of 2, ie it has one column for false and one for true. Its redundant information when you only need true or false but I use it because its easy to change the number of classes since you don’t need to change anything else. The prediction therefore needs one column per class. This can be confusing since no matter how many classes you have from your prediction, the target value is a single column with long integers between 0 and no_classes-1

You can see how I have used it in the example below (in the Class ClassifyFromAE)

github.com

fromLittleAcorns/Credit-Card-Fraud/blob/master/fraud_demo_classify.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Credit Card Fraud Analysis\n",
    "\n",
    "Script to explore prediction of fraud with the following dataset:\n",
    "Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi.\n",
    "Calibrating Probability with Undersampling for Unbalanced Classification.\n",
    "In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015\n",
    "\n",
    "The dataset can be downloaded from:\n",
    "https://www.kaggle.com/dalpozz/creditcardfraud\n",
    "\n",
    "The approach will be to use an autoencoder to understand learn about the dataset\n",
    "and to identify the transactions with the biggest error from the autoencoder reconstruction.\n",
    "\n",
    "The autoencoder will then be used as pre-training for a neural network classifier"

This file has been truncated. show original

The training, validation, loss and back propagation etc are managed by a class NN_Manage since this was all written prior to the fastai library.

Not sure this will solve your problem but it looks as though it could be part of the problem.

(note - I checked this with a simple example and I think it is correct)

ronlut · April 5, 2018, 3:45pm

Thanks again.
I got it to work about an hour ago exactly by changing out_sz to 10.
So i’m happy we came to the same conclusion
I wrote an explanation here if someone is interested.

anuragchil · April 7, 2018, 3:09am

I am facing the same issue. Were you able to figure it out?

ronlut · April 7, 2018, 3:03pm

I added a multi-label classification ability to the ColumnarModelData if anyone is interested

Patrick · April 7, 2018, 10:42pm

This seems like it might be the most appropriate place for a more general question because I think it would mostly apply to problems that involve structured datasets. Does anyone have any thoughts on how to incorporate observation weights into PyTorch? What I mean by observation weights is that one observation might have been observed for a longer period of time than another observation and thus the first observation has more information and thus should inform the training more. For instance, if the goal of the model is to predict whether or not a car accident occurred and we have different observation lengths for each record, I want to inform the model of this fact.

My first guess about how to go about incorporating observation weights is to scale the loss contribution by the weight such that when the loss gets propagated back, the parameters are ‘weight-aware’ in their updates but I really don’t have any idea how to do that in PyTorch without breaking everything. I’ve looked around the internet and I haven’t really seen this question asked or addressed. Any thoughts?

dangoldner · April 8, 2018, 8:06pm

@patrick This is perhaps naive, but could you simply make observation_period a feature of each observation, and let the training process decide what influence that should have?

Patrick · April 8, 2018, 8:42pm

@dangoldner That is not a bad idea at all, but if we know a priori that the probability of an accident scales linearly with observation length, I think it would be better to inform the model of this than require the model to learn it. Also, the solution you propose is not as general. Consider another use case where a dataset has been downsampled across some dimension. For example, every 5th observation that has a response value of ‘0’ is kept and the other four are discarded. We might do this if the original dataset is large and there’s a class imbalance. In order to get the overall average prediction right, we need to inform the model that every record with a response value of ‘0’ is actually representative of five records. The only way I know how to do that is through observation weights.

anuragchil · April 11, 2018, 2:01am

I am able to fix the issue by mapping the categorical variables from 0 to n.

Ralph · April 21, 2018, 10:02pm

I’ve been working on a multi-class structured learner with embeddings for the categorical data and it seems to train well, with the val_loss steadily dropping, but I am having problems with the predictions. When I call learn.predict, I get an array with the dimensions of test_df x first hidden layer (2048), rather than test_df x out_sz (48). If I drop all the hidden layers, I get test_df x len(input with all the embeddings) (772).

md = ColumnarModelData.from_data_frames(’/tmp’, trn_df, val_df, trn_y.astype(‘int’), val_y.astype(‘int’), cats, 512, is_reg = False, test_df = test_df)
model = MixedInputModel(emb_szs, len(contins), emb_drop=0, out_sz=48, szs=[2048,1024,512], drops=[0.1]).cuda()
bm = BasicModel(model, ‘muticlass_classifier’)
learn = StructuredLearner(md, bm)

Can anybody spot what I am doing wrong?

Running Amazon linux on a p2.