Ooo, I had the same problem last night. My dependent variable was a column of floats even though it was binary. Adding the dep-var to cat_names
or leaving it to be continuous didn’t change anything: the DataBunch’s task type became Regression and data.c
was None.
Changing the dtype of the dataframe’s dep-var column to np.int64
got it treated as classification with data.c
equaling 2.
I have an aside question on tabular: has anyone seen test-set accuracy drop off a cliff when removing the dependent variable from a DataBunch’s test set? Like 99% → 69%.
That’s to say: having your test dataframe include the labels column vs holding that column out as an array.
update:
Kind-of a duhh moment: looks like the indices of the 0,1 classes are just encoded in (I’m guessing) descending alpha or numeric order.
In other words:
learn.data.train_ds.class2idx
gives:
{1: 0, 0: 1}
.
So a ‘1’ becomes the 0-index, and ‘0’ becomes 1.
update 2:
nevermind. despite this being the case, if the dependent variable’s column is present in the test set, accuracy is great; when it isn’t: accuracy drops.
This also happens if a learner is trained with a TabularDataBunch that doesn’t contain a test set. If you set a new .data with a test dataframe containing the dep-var and run predictions – then do the same thing without the dep-var: the same thing happens.
I think what’s going on is I allowed my dependent variable to be in the list of cat_names…
I just tested this now, making sure the dep-var isn’t in cat_names and… moderate accuracy. Wow, the model was literally learning to peak at the back of the book.