I am trying to apply the same approach used in the Rossmann dataset to an ongoing competition on Kaggle:
My goal is to generate high-quality embedding with the approach that Jeremy showed us to a different dataset. At the same time I am trying to learn how far I can push this concept of embeddings in higher cardinality entities.
What have I done
I am predicting a binary target (download yes/no, 1/0) based on 8 categorical features:
I prep the data just like in the Rossmann case:
df, y, nas = proc_df(df=joined, y_fld = 'is_attributed', do_scale=False)
df_test, _, nas = proc_df(joined_test, 'is_attributed', na_dict=nas)
Then find the embedding dimensionality:
cat_sz = [(c, len(joined[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
Since I have a binary classifier, I code in the range [0,1]:
y_range = (0, 1)
This may not be correct.
I then call
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
And I finally get an error when I try to run:
RuntimeError: Expected object of type Variable[torch.cuda.FloatTensor] but found type Variable[torch.cuda.LongTensor] for argument #1 'target'
What else have I tried
I sense the issue may be in how I am coding the target variable. I changed
ColumnarModelData.from_data_frame() to take
y.astype(np.float32), i.e. as a float instead of an int. In this case
m.lr_find() runs fine, but then I can’t run
m.fit(lr, 3, metrics=[roc_auc_score]):
roc_auc_score complains that
ValueError: continuous format is not supported.
I have looked at the source code of
ColumnarModelData.from_data_frame in the hope of finding some clue about how to change the target encoding, but no luck
How am I supposed to change the code in lesson 3/Rossmann to use tabular data with a binary target?