Independent variables for training and test set

orangegirlu · June 12, 2020, 7:10pm

Hi everyone,

I am working on tabular data (bike sharing demand) on Kaggle. It gives training (11 independent variables) and test set (9 variables) separately, but test set doesn’t have 2 independent variables (registered and causal).

When I tried to get prediction by using test set, it shows index error, because (I guess) training and test set are not the same variable size. In this case, do I have to not use “registered and causal” variables to train??

RF shows “registered” is the most important feature so I didn’t want to remove it.

I hope to hear from you guys, and thank you for your time!

stefan-ai · June 14, 2020, 10:42am

I think there are two ways of dealing with this:

You remove the two variables from the training set. Since you don’t have these variables available at test time, it’s usually not right to rely on them for training.
If you don’t want to remove the variables from your training set, one workaround would be to add these variables as empty columns to your test set. In this case, I believe fastai will fill in the missing values in your test set based on the median values from your training set.

I would try removing the variables from your training set first, check the accuracy (or whatever metric you care about) on the test set and have a look at feature importance. Then you can try adding the empty columns and see how your test set accuracy changes.

Since it’s a kaggle competition, you can also have a look at the work of other people to see how they have dealt with this issue.

orangegirlu · June 15, 2020, 1:37pm

Thank you! I will try to do both ways

muellerzr · June 15, 2020, 1:42pm

Yes (or if they are categorical, a special #na# tag and an is_missing column, such as your “registered” variable)