Lesson 3 applied to a current Kaggle competition

Context

I am trying to apply the same approach used in the Rossmann dataset to an ongoing competition on Kaggle:

Goal
My goal is to generate high-quality embedding with the approach that Jeremy showed us to a different dataset. At the same time I am trying to learn how far I can push this concept of embeddings in higher cardinality entities.

What have I done
I am predicting a binary target (download yes/no, 1/0) based on 8 categorical features:

I prep the data just like in the Rossmann case:

df, y, nas = proc_df(df=joined, y_fld = 'is_attributed', do_scale=False)
df_test, _, nas = proc_df(joined_test, 'is_attributed', na_dict=nas)

Then find the embedding dimensionality:

cat_sz = [(c, len(joined[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

Since I have a binary classifier, I code in the range [0,1]:

y_range = (0, 1)

This may not be correct.

I then call .getlearner():

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)

And I finally get an error when I try to run:

m.lr_find()

RuntimeError: Expected object of type Variable[torch.cuda.FloatTensor] but found type Variable[torch.cuda.LongTensor] for argument #1 'target'

What else have I tried
I sense the issue may be in how I am coding the target variable. I changed ColumnarModelData.from_data_frame() to take y.astype(np.float32), i.e. as a float instead of an int. In this case m.lr_find() runs fine, but then I can’t run m.fit(lr, 3, metrics=[roc_auc_score]):

roc_auc_score complains that ValueError: continuous format is not supported.

I have looked at the source code of ColumnarModelData.from_data_frame in the hope of finding some clue about how to change the target encoding, but no luck

Question
How am I supposed to change the code in lesson 3/Rossmann to use tabular data with a binary target?

Thank you

3 Likes

@gballardin
Question:
How am I supposed to change the code in lesson 3/Rossmann to use tabular data with a binary target?

you can refer to

for time series binary classification.

2 Likes

@shwetap7 that does look like the same abstractly, but on a different dataset. It is a bit more involved than I was expecting at first. Something new I need to learn on a practical problem.

Thank you!

1 Like

By the way, the approach from the github repo above worked for. The only tricky bit is that it seemed in that case he had already pre-processed the data and did not use proc_df(). That did not work for me. Kept getting some hard to decipher cuda 59 error.

Anyway, preprocessing the data with proc_df() made it work like a champ.

Thanks for sharing @shwetap7!