Structured Learner

MicPie · July 31, 2018, 6:26pm

Hmm, no matter what I do my NN with the structured learner seems to get always stuck at the same metric while the loss is decreasing. Then, the prediction always generates the same output for all data points. This happens with additional data (other embeddings) and with different hidden layer numbers.

I do not get any error message and I’m a little clueless because I don’t have an error to debug.

Did somebody encountered this strange behavior before?

PranY · August 1, 2018, 6:37am

Try using .cpu() on your MixedInputModel and add a pdb.set_trace() in accuracy definition and follow it step by step. I think that should help, I’ll be glad to help if you can share the notebook you are using.

stas · August 2, 2018, 5:13am

Thank you for updating the notebook and putting detailed steps to explain what you were presenting, @PranY

When I run it I get multiple warnings on every batch of fit/lr_find run:

[...]python3.6/site-packages/torch/nn/functional.py:1189: UserWarning: Using a target size (torch.Size([512])) that is different to the input size (torch.Size([512, 1])) is deprecated. Please ensure they have the same size.
  "Please ensure they have the same size.".format(target.size(), input.size()))

PranY · August 2, 2018, 5:54am

You can get rid off those warnings by y = y.reshape(len(y),1) before passing it to the learner, that should work nicely. I haven’t tried it yet but thats what the error clearly points to.

Fuhgidabowit · August 3, 2018, 12:43pm

Not sure if I’m up to date but I found several modifcation of fast.ai structured learner for binary response in this topic. Which of them is now definetely working ?

Thanks,

KS

stas · August 3, 2018, 9:21pm

See @dtylor’s notebook Structured Learner For Kaggle Titanic - you don’t need to change anything in the structured learner classes in the current code base.

stas · August 3, 2018, 10:08pm

Yes, your breakdown is great, @PranY. We need to ask someone experienced to see whether that is_multi method binary classification-like method is a good one. Meanwhile I will try it with a real dataset - it’s harder to compare performance with random numbers

In your notebook, you write:

I’m still not sure how to use the accuracy_multi if I don’t set the default threshold in the function definition,
I also can’t pass just threshold during fit as it would require preds and targs

This is how you do it:

def accuracy_multi_thresh(thresh): 
    return lambda preds,targs: accuracy_multi(preds, targs, thresh)

m.fit(1e-5, n_cycle=2, cycle_len=1, cycle_mult=2, metrics=[accuracy_multi_thresh(0.6)])

It’s already in the library, except it’s called accuracy_thresh:

m.fit(1e-5, n_cycle=2, cycle_len=1, cycle_mult=2, metrics=[accuracy_thresh(0.6)])

stas · August 4, 2018, 11:05pm

OK, I finally had a chance to apply your alternative approach to binary classification to a real dataset, @PranY.

I have a very basic titanic kaggle notebook.

I used a fixed seed to make sure I’m comparing apples to apples:

torch.manual_seed(40)
random.seed(40)

Once the data is ready I run it through 2 different approaches, with only differences in a few arguments listed for each entry as its title:

Approach 1: is_reg=False, is_multi=False, out_sz=2 (crit: nll_loss)

md = ColumnarModelData.from_data_frame(PATH, valid_idx, train_proc_df, y.astype('int64'), cat_flds=cat_vars, bs=32, 
                                       is_reg=False, is_multi=False, test_df=test_proc_df)
m = md.get_learner(emb_szs=emb_szs, n_cont=(len(train_proc_df.columns)-len(cat_vars)), emb_drop=0.04, out_sz=2, 
                   szs=[1000,500], drops=[0.001,0.01], y_range=y_range, use_bn=False)
lr = 1e-3
m.fit(lr, 1, metrics=[accuracy, f1, precision, recall])
m.fit(lr, 2, cycle_len=2, cycle_mult=3, metrics=[accuracy, f1, precision, recall])
preds = np.argmax(m.predict(True), axis=1)

Approach 2: is_reg=False, is_multi=True, out_sz=1 (crit: binary_cross_entropy)

y = y.reshape(len(y),1)
md = ColumnarModelData.from_data_frame(PATH, valid_idx, train_proc_df, y.astype(np.float32), cat_flds=cat_vars, bs=32,
                                       is_reg=False, is_multi=True, test_df=test_proc_df)
m = md.get_learner(emb_szs=emb_szs, n_cont=(len(train_proc_df.columns)-len(cat_vars)), emb_drop=0.04, out_sz=1, 
                  szs=[1000,500], drops=[0.001,0.01])
lr = 1e-3
m.fit(lr, 1, metrics=[accuracy_thresh(0.5)])
m.fit(lr, 2, cycle_len=2, cycle_mult=3, metrics=[accuracy_thresh(0.5)])
preds2 = m.predict(True)
preds2 = np.concatenate((preds2>0.5)*1)

I had to comment out in column_data.py for this to work:

#if is_reg==False: assert out_sz >= 2, "arg is_reg==False (classification) requires out_sz>=2"

Results:

Comparing approaches 1 with 2:

(preds==preds2).mean()
0.9521531100478469

pretty close! And submitting both to kaggle, surprisingly both received the same score of 0.77511

So to me it looks that your approach works similar to the other method, @PranY.

It’d be a very good idea to try it with another, perhaps much bigger dataset. Anybody has a binary classification notebook with a bigger structured dataset that we can try this on? And ideally where we know the correct predictions.

Also could you please check that my prediction code is correct in the 2nd case?

PranY · August 5, 2018, 7:54am

Thanks for confirming this.

I tried it on an active Kaggle competition and the results are same on LB, in fact the sample data and notebook I had shared are from the same competition but I applied a linear shift on data to avoid breaking any Kaggle rules.

Yeah it looks perfect as far as I can tell

Thanks for this, I somehow overlooked it completely

Maybe @jeremy can help us with this, in the meantime, I hope others also try and test out the idea and maybe we can conclude something or find something totally new.

jeremy · August 5, 2018, 1:14pm

Can you summarize what you’ve done, and what you want help with, please?

stas · August 5, 2018, 4:18pm

@PranY proposed doing binary classification for a structured learner using (#2) BCE, I posted a summary here. We were just asking for someone experienced to confirm that this is a good alternative approach. And perhaps whether is has any benefits over nll_loss method (#1).

ramin36 · August 5, 2018, 4:40pm

Hi,

I enjoyed watching the tutorial video. It seemed pretty clear and straighforward on how to apply it, however when I looked at the notebook in the repo and trying it out myself, things turned out to be a bit tricky.

I thought I better post my question here before creating a new topic.

First of all I am a bit confused why the index are set to Date, why that is important in general to change the index.

Also after the validation set is set to last 2 weeks

val_idx = np.flatnonzero(
    (df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

i the following cell is just set to an list with the entry 0…

When I try the example on my own data (category data only)

when I call get_learner I get the error:

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got CPUFloatTensor instead (while checking arguments for embedding)

Maybe somebody can tell me the answer before I go ahead upload the dataset somewhere and my code…
It is pretty simple.

Another general thought I had. To make the whole thing usable by non-experts, It would be cool, to turn everything into one callable function, that takes a (clean) dataframe and basically derives itself from the value_counts of each column, if it should be used as a category or a continuous variable.

jeremy · August 5, 2018, 10:27pm

NLL and CrossEntropy are the same - just changes whether the final activation function and loss function are combined (CrossEntropy) or not (NLL). Generally CrossEntropy is recommended for performance and numeric stability, although I often used NLL in the course to make it more clear what was going on.

BCE and CrossEntropy are equivalent in the case that you have a single one-hot encoded value (BCE) vs an int representing a level (CrossEntropy). However, BCE can also handle multiple labels (such as for the Planet amazon rainforest dataset).

Does that answer your question?

stas · August 5, 2018, 11:12pm

Thank you, Jeremy!

except fastai currently doesn’t handle the expected by pytorch y’s shape internally and requires a manual tweak:

y = y.reshape(len(y),1)

before feeding it to ColumnarModelData.from_data_frame() in approach 2. otherwise pytorch complains.

Would BCE have been a better choice in the Rossman notebook then?

Hopefully in v1 we will have demo notebooks for each of those different types of ml-tasks (e.g. in this case a variety of binary classification methods for a structured learner).

jeremy · August 5, 2018, 11:28pm

The real problem here is that pytorch <0.4 didn’t have ‘rank 0 (scalar) tensors’, so that indexing into a vector gave to back a plain python number, not a pytorch tensor. This is now fixed in 0.4+:

So in v1 we’ll be able to remove this little hackiness.

Yes!

steveml · October 20, 2018, 10:23pm

Hi I’m trying to do multi class classification for a dependent variable with 3 values.
I’ve read through all the comments, and tried several things but have been unsuccessful.
Does anyone have a notebook or clear instructions on how i could accomplish this?

Thanks in advance

yangzan · January 21, 2019, 8:49am

Hello，where to download all of your datasets, could you give a link? thank you

quan.tran · January 22, 2019, 4:36pm

The original data itself is included in the Kaggle competition: https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

The data that I used to feed in my model was heavily preprocessed, which you can find the related 2 notebooks here: https://github.com/anhquan0412/talkingdata_clickfraud#preprocess

yangzan · January 23, 2019, 7:48am

thanks for your help.
you define accuracy function as the metirc, how can define RECALL or F1 Score for the extremely unbalanced dataset ?

mdeore · January 29, 2019, 4:15am

With Approach 2: is_reg=False, is_multi=True, out_sz=1 (crit: binary_cross_entropy)

My experiment is for multi-class classification, i found the loss is going haywire. What i observed is that the dependent variable MUST have multiple rows (1000+) in the table to control over loss. In the binary classification case that is not a problem.

The same dataset works perfectly fine when i want to predict ‘a’ number as Jeremy mentioned on rossmann lesson. Like: Sales, price, etc. I think because we have ‘Sales’ in each row of the table. I still need to do quite a few other experiments, i think for multi-class i have to use ‘softmax’ i.e. (Approach 1: is_reg=False, is_multi=False, out_sz=2 (crit: nll_loss)) so that increasing the output value of one class makes the others go down .

Please correct me if am wrong?