Using fastai v1 for tabular data

Borz · November 13, 2018, 6:25pm

Ooo, I had the same problem last night. My dependent variable was a column of floats even though it was binary. Adding the dep-var to cat_names or leaving it to be continuous didn’t change anything: the DataBunch’s task type became Regression and data.c was None.

Changing the dtype of the dataframe’s dep-var column to np.int64 got it treated as classification with data.c equaling 2.

I have an aside question on tabular: has anyone seen test-set accuracy drop off a cliff when removing the dependent variable from a DataBunch’s test set? Like 99% → 69%.

That’s to say: having your test dataframe include the labels column vs holding that column out as an array.

update:

Kind-of a duhh moment: looks like the indices of the 0,1 classes are just encoded in (I’m guessing) descending alpha or numeric order.

In other words:

learn.data.train_ds.class2idx gives:
{1: 0, 0: 1}.

So a ‘1’ becomes the 0-index, and ‘0’ becomes 1.

update 2:

nevermind. despite this being the case, if the dependent variable’s column is present in the test set, accuracy is great; when it isn’t: accuracy drops.

This also happens if a learner is trained with a TabularDataBunch that doesn’t contain a test set. If you set a new .data with a test dataframe containing the dep-var and run predictions – then do the same thing without the dep-var: the same thing happens.

I think what’s going on is I allowed my dependent variable to be in the list of cat_names…

I just tested this now, making sure the dep-var isn’t in cat_names and… moderate accuracy. Wow, the model was literally learning to peak at the back of the book.