Structured Learner

My explanation is for classification scenarios only.

For classification:

Use out_sz=2, is_reg=False, is_multi=False and it recognizes to use NLL
Use out_sz=1, i2_reg=False, is_multi=True and set your target to FloatTensor, this will work with BCE

Both ways I trained a binary classifier.

1 Like

I believe that mathematically your architecture+loss function is equivalent to the [2 output + Softmax + nll_loss] architecture+loss function. Could it be your results are better due to randomness?

I think so, I ran multiple iterations with same architecture and the 2 cases and they pretty much converged on the same val_loss

Shouldn’t classification have at least 2 outputs? If, that’s true then this change of mine is wrong: https://github.com/fastai/fastai/pull/654 and needs to be reverted. My intent was to check on the model-level whether the inputs and model are of the correct configuration to avoid all kinds of failures from pytorch. The check was:

if is_reg==False: assert out_sz >= 2, "arg is_reg==False (classification) requires out_sz>=2"

do you think it’ll still be a good validator if it’s adjusted to:

 if is_reg==False and is_multi==False: 
    assert out_sz >= 2, "arg is_reg==False/ is_multi==False (classification) requires out_sz>=2"

or whether we should allow any out_sz for perhaps there are other cases where it’s needed?

Also, do you have a working notebook that I could play with it to check those both combinations. Thank you.

I read your conversation with sgugger here and I think I’ll still stick to what I mentioned earlier as I tested it myself.

For out_sz =1, the loss would be BCE, the pytorch documentation here says

Shape:

    Input: (N,∗) where * means, any number of additional dimensions
    Target: (N,∗), same shape as the input
    Output: scalar. If reduce is False, then (N, *), same shape as input.

For out_sz =2, the loss would be NLL, the pytorch documentation here says

Shape:

    Input: (N,C) where C = number of classes, or (N,C,d1,d2,...,dK) with K≥2 in the case of K-dimensional loss.

    Target: (N) where each value is 0≤targets[i]≤C−1, or (N,d1,d2,...,dK) with K≥2 in the case of K-dimensional loss.

    Output: scalar. If reduce is False, then the same size as the target: (N), or (N,d1,d2,...,dK) with K≥2 in the case of K-dimensional loss.

As you can see, the input for the relevant losses can be (N,*) or (N,C), in case of binary classification, we can feed in input both ways and use relevant last layer (Log softmax or not).

Well, I’m unsure, I did find it hard to figure out what works when, but I think out_sz can still sit without check because the error that I catch from pytorch clearly points to the same. At times pytorch fails badly so I debug on cpu with pdb, I think I’ll share a notebook with all possiblities of regression and classification and leave it for everyone to decide what is the best way out.

1 Like

I didn’t had any working notebook but it took a few mins to create this one, hope it helps.
The data used for the example is also available at the same location. Please share your thoughts.

I found similar result with my recent experiment, can you please take a look if the results are only better due to randomness?

Thank you for sharing the code, @PranY

Before looking at the correctness/accuracy let’s first make the code work.

I needed to move cat_sz and emb_szs creation before proc_df, otherwise the code fails. since proc_df reduces cat columns to numbers. but that’s a minor thing - just couldn’t use your code out of the box. perhaps you could update your notebook - in case others decide to experiment with it.

So now let’s add metrics:

fit(..., metrics=[accuracy])

it works in binary classification out_sz=2, but fails with out_sz=1.

TypeError: eq received an invalid combination of arguments - got (torch.cuda.FloatTensor), but expected one of:
 * (int value)
      didn't match because some of the arguments have invalid types: (torch.cuda.FloatTensor)
 * (torch.cuda.LongTensor other)
      didn't match because some of the arguments have invalid types: (torch.cuda.FloatTensor)

That’s actually why I added that assert out_sz>1 - as I was getting this error in titanic dataset and for a long time I failed to see that I had out_sz=1 :frowning: so I thought it’d save someone some hair.

That’s said if someone has a way to resolve this then we can look at accuracy next.

The problem is that it wants y to be long integer and you gave it float32. But if you switch y to Integer then a reverse problem appears, I don’t remember the exact error now, but again type mismatch.

Sure, I’ll do that, Thanks for pointing it out.

Ah, I understand your point, I’ll look into it and get back to you as soon as I can. If I find a way out, I’ll simply update the same notebook and revert back.

1 Like

Hi,

I moved cat_sz and emb_szs before proc_df (Thanks for suggesting this), Now it should work out of the box. Although I’ve tried to ignore warnings but it seems I still get warnings during the call to .fit. Will look into it later. Notebook should work fine end-to-end.

I have discussed the metrics part in detail, I hope I was able to convey the message. Please let me know if that helps. Feel free to drop in any questions you may have. :slight_smile:

The notebook is available in the same location here

1 Like

Hmm, no matter what I do my NN with the structured learner seems to get always stuck at the same metric while the loss is decreasing. Then, the prediction always generates the same output for all data points. This happens with additional data (other embeddings) and with different hidden layer numbers.

I do not get any error message and I’m a little clueless because I don’t have an error to debug. :wink:

Did somebody encountered this strange behavior before?

1 Like

Try using .cpu() on your MixedInputModel and add a pdb.set_trace() in accuracy definition and follow it step by step. I think that should help, I’ll be glad to help if you can share the notebook you are using.

1 Like

Thank you for updating the notebook and putting detailed steps to explain what you were presenting, @PranY

When I run it I get multiple warnings on every batch of fit/lr_find run:

[...]python3.6/site-packages/torch/nn/functional.py:1189: UserWarning: Using a target size (torch.Size([512])) that is different to the input size (torch.Size([512, 1])) is deprecated. Please ensure they have the same size.
  "Please ensure they have the same size.".format(target.size(), input.size()))

You can get rid off those warnings by y = y.reshape(len(y),1) before passing it to the learner, that should work nicely. I haven’t tried it yet but thats what the error clearly points to.

Not sure if I’m up to date but I found several modifcation of fast.ai structured learner for binary response in this topic. Which of them is now definetely working :slight_smile: ?

Thanks,

KS

See @dtylor’s notebook Structured Learner For Kaggle Titanic - you don’t need to change anything in the structured learner classes in the current code base.

Yes, your breakdown is great, @PranY. We need to ask someone experienced to see whether that is_multi method binary classification-like method is a good one. Meanwhile I will try it with a real dataset - it’s harder to compare performance with random numbers :wink:

In your notebook, you write:

I’m still not sure how to use the accuracy_multi if I don’t set the default threshold in the function definition,
I also can’t pass just threshold during fit as it would require preds and targs

This is how you do it:

def accuracy_multi_thresh(thresh): 
    return lambda preds,targs: accuracy_multi(preds, targs, thresh)

m.fit(1e-5, n_cycle=2, cycle_len=1, cycle_mult=2, metrics=[accuracy_multi_thresh(0.6)]) 

It’s already in the library, except it’s called accuracy_thresh:

m.fit(1e-5, n_cycle=2, cycle_len=1, cycle_mult=2, metrics=[accuracy_thresh(0.6)])
1 Like

OK, I finally had a chance to apply your alternative approach to binary classification to a real dataset, @PranY.

I have a very basic titanic kaggle notebook.

I used a fixed seed to make sure I’m comparing apples to apples:

torch.manual_seed(40)
random.seed(40)

Once the data is ready I run it through 2 different approaches, with only differences in a few arguments listed for each entry as its title:

Approach 1: is_reg=False, is_multi=False, out_sz=2 (crit: nll_loss)

md = ColumnarModelData.from_data_frame(PATH, valid_idx, train_proc_df, y.astype('int64'), cat_flds=cat_vars, bs=32, 
                                       is_reg=False, is_multi=False, test_df=test_proc_df)
m = md.get_learner(emb_szs=emb_szs, n_cont=(len(train_proc_df.columns)-len(cat_vars)), emb_drop=0.04, out_sz=2, 
                   szs=[1000,500], drops=[0.001,0.01], y_range=y_range, use_bn=False)
lr = 1e-3
m.fit(lr, 1, metrics=[accuracy, f1, precision, recall])
m.fit(lr, 2, cycle_len=2, cycle_mult=3, metrics=[accuracy, f1, precision, recall])
preds = np.argmax(m.predict(True), axis=1)

Approach 2: is_reg=False, is_multi=True, out_sz=1 (crit: binary_cross_entropy)

y = y.reshape(len(y),1)
md = ColumnarModelData.from_data_frame(PATH, valid_idx, train_proc_df, y.astype(np.float32), cat_flds=cat_vars, bs=32,
                                       is_reg=False, is_multi=True, test_df=test_proc_df)
m = md.get_learner(emb_szs=emb_szs, n_cont=(len(train_proc_df.columns)-len(cat_vars)), emb_drop=0.04, out_sz=1, 
                  szs=[1000,500], drops=[0.001,0.01])
lr = 1e-3
m.fit(lr, 1, metrics=[accuracy_thresh(0.5)])
m.fit(lr, 2, cycle_len=2, cycle_mult=3, metrics=[accuracy_thresh(0.5)])
preds2 = m.predict(True)
preds2 = np.concatenate((preds2>0.5)*1)

I had to comment out in column_data.py for this to work:

#if is_reg==False: assert out_sz >= 2, "arg is_reg==False (classification) requires out_sz>=2"

Results:

Comparing approaches 1 with 2:

(preds==preds2).mean()
0.9521531100478469

pretty close! And submitting both to kaggle, surprisingly both received the same score of 0.77511

So to me it looks that your approach works similar to the other method, @PranY.

It’d be a very good idea to try it with another, perhaps much bigger dataset. Anybody has a binary classification notebook with a bigger structured dataset that we can try this on? And ideally where we know the correct predictions.

Also could you please check that my prediction code is correct in the 2nd case?

1 Like

Thanks for confirming this.

I tried it on an active Kaggle competition and the results are same on LB, in fact the sample data and notebook I had shared are from the same competition but I applied a linear shift on data to avoid breaking any Kaggle rules.

Yeah it looks perfect as far as I can tell

Thanks for this, I somehow overlooked it completely :man_facepalming:

Maybe @jeremy can help us with this, in the meantime, I hope others also try and test out the idea and maybe we can conclude something or find something totally new.

Can you summarize what you’ve done, and what you want help with, please?