Predictions with tabular / mismatched features

bento · June 25, 2020, 10:14pm

First, I want to say what a wonderful resource the fastai package, videos & course are. Thank you!

I am trying to apply the lessons, to test what I have learned. I thought a current Kaggle tabular dataset would be a good place to start.

The “predict future sales” task link is basically a variant of the Rossmann example in the course.

In the Rossman example, the number of columns in the test dataset is 1 less than the train dataset. However, in “predict future sales” the number of columns / features in the test dataset is limited. Just shop_id and item_id.

The gap in my knowledge is: how to predict using a model trained on many features for a test / holdout dataset with fewer features?

At the moment I get the error: “None of [Index([‘item_price’], dtype=‘object’)] are in the [columns]”

Here is the head() of each data frame:

Test:
|shop_id|item_id|

Here is my notebook GitHub link

The error implies I should add all the columns contained in the training dataset to the test dataset, but then what should the values be, NaN / 0 at each entry?

Cheers,
Ben

muellerzr · June 25, 2020, 10:19pm

The simple answer is you can’t. Your model expects those (n) inputs during training to be there in inference, and simply putting them as NA won’t work as it’ll give you poor results. I’d possibly look into other feature engineering ideas to try to help.