Independent variables for training and test set

Hi everyone,

I am working on tabular data (bike sharing demand) on Kaggle. It gives training (11 independent variables) and test set (9 variables) separately, but test set doesn’t have 2 independent variables (registered and causal).

When I tried to get prediction by using test set, it shows index error, because (I guess) training and test set are not the same variable size. In this case, do I have to not use “registered and causal” variables to train??

RF shows “registered” is the most important feature so I didn’t want to remove it.

I hope to hear from you guys, and thank you for your time!

I think there are two ways of dealing with this:

  • You remove the two variables from the training set. Since you don’t have these variables available at test time, it’s usually not right to rely on them for training.
  • If you don’t want to remove the variables from your training set, one workaround would be to add these variables as empty columns to your test set. In this case, I believe fastai will fill in the missing values in your test set based on the median values from your training set.

I would try removing the variables from your training set first, check the accuracy (or whatever metric you care about) on the test set and have a look at feature importance. Then you can try adding the empty columns and see how your test set accuracy changes.

Since it’s a kaggle competition, you can also have a look at the work of other people to see how they have dealt with this issue.

Thank you! I will try to do both ways :smiley:

Yes (or if they are categorical, a special #na# tag and an is_missing column, such as your “registered” variable)