Kaggle Tabular Multilabel Classification - TabularPandas issue

Hey guys,
I was trying to setup a simple baseline model for the recent Kaggle Competition lish-moa . It is a problem of Multilabel Classification and while setting up my TabularPandas object, I came across 2 issues and have a few doubts.

Issue 1

I had a list dep_vars, which contained all the names of the dependent variables. But after setting up my TabularPandas object with options which I believe are correct, like, cat_names, cont_names and y_names, I tried checking my to.train.y values, which showed only the first column of dep_vars list.

to = TabularPandas(df, procs=procs, cat_names=cat, cont_names=cont,
                   y_names=dep_vars, splits=splits, device="cuda")

Issue 2

There is a column sig_id in the data, which I don’t include in neither cat_names nor cont_names but yet it shows up when I run to.items.head(5). And further, when I run fit_one_cycle on a tabular_learner, an error pops up saying RuntimeError: Found dtype Char but expected Float.


  • Since the evaluation metric is log_loss, I’m not sure what loss_func and metric to use. According to the course, we know its either nn.BCELoss() or nn.BCEWithLogitsLoss(), and I’ve seen people use the first, but I was confused because don’t we have to include the sigmoid function since the log_loss function requires the predicted probability?

  • Is it necessary for the categorical variables to be of type category before using it for a TabularPandas object? Or does the proc, Categorify do it for you?

  • And what is the use of n_out parameter in tabular_learner? And do we have to set it even after we mention the dep_vars while defining our TabularPandas object?

Any help would be really appreciated! Thanks!

1 Like

That is expected in the code, to.targ has the full y_names

It won’t limit what data you actually see, it will still store the entire dataframe.

For a minimal example, I can not include education on the adults example, but to.items will still show it! However if I do to.cats.head(), it will not.

If I do to.show() (to show the first 10 rows) it also will not show up there too.

Likely your key error is due to another reason unspecified, so we’d need to know more about your declarations.

Categorify takes care of this

Sets the size of the last layer (the output layer). does the same thing for vision models too.

Yes and no. In this competition we need to pair it with a y_range since we want probabilities between 0 and 1, so we pass in a y_range = (0,1)

Thanks for letting me know about the to.targs where I could see all the target values.

Likely your key error is due to another reason unspecified, so we’d need to know more about your declarations.

My declarations are as follows:

dep_vars = list(targ_df.columns)[1:] # first value is 'sig_id'
df = feat_df.merge(targ_df, on='sig_id')

cont, cat = cont_cat_split(df, max_card=100, dep_var=dep_vars)

times = 24, 48, 72
df['cp_time'] = df['cp_time'].astype('category')
df['cp_time'].cat.set_categories(times, ordered=True, inplace=True)

procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter(valid_pct=0.1)(range_of(df))

to = TabularPandas(df, procs=procs, cat_names=cat, cont_names=cont, y_names=dep_vars, splits=splits, device="cuda")
dls = to.dataloaders(bs=1024)

learn = tabular_learner(dls, layers=[1024,512,256], n_out=len(dep_vars), y_range=(0,1), loss_func=nn.BCELoss())

And I tried checking dls.one_batch() which returned 3 variables, which I’m guessing are the cont, cat and targ values. The output was as follows:

So since none of the dtypes are Char, I’m still unsure what’s causing that error.

I just narrowed down my mistake to using the wrong loss_func, which should be BCEWithLogitsLossFlat(). But if anyone could explain why this would cause the error, RuntimeError: Found dtype Char but expected Float., that would be really helpful!

I had some issues with BCEWithLogits as well, at the end it turned out that the problem was using y_range in the learner while using a loss with sigmoid, which doesn’t make much sense.