Features like "year" in tabular data and similar

BjoernB · October 11, 2019, 8:30am

Assume you have training and validation data from historical events but want to predict upcoming things (but not as time series, more like in the Rossmann challenge). Using the year as one of the features may turn out to be a very useful.
As far as i can tell, it might be a good idea to view it as categorical feature (and thus learn year embeddings).

Of course, this does not work when you will encounter unseen years in production (or your test set). First of all, there will not even be an embedding, but even if there was, it would be untrained and useless at best but probably harmful.

Is it a good idea to use the year as a continuous variable instead? Sure, I could try to create a validation set with the newest year only in valid and not in train. However, if this turns out so that a continuous year feature works well, this does not necessarily mean that it would be true in other upcoming years, would it?

After all the model might have learned to predict global economic growth but falsely assume something linear when there are actual cycles. In such a case, my validation set may seem as if the model took something useful out of the continuous year feature while while it is actually harmful on a test set.

Would you recommend to abandon the year altogether, instead?