Encoding known ordinal data as unknown

ArchieIndian · August 20, 2018, 10:59am

In the structured data modelling lesson, it is suggested to use something like a year as a categorical data. In that case, the increasing nature of the variable year is lost. So, say the model was built on data from 2015 to 2017 and we want to deploy it on 2018 or on 2019. Since a year is a categorical variable, the trending nature of the time series data will be lost. Any thoughts around this?

Gabriel_Syme · August 21, 2018, 7:59am

I’m not so sure this is exactly the case.

I think you need to think of this as an embedding and not as the typical categorical data handling (e.g. one hot encoding). I would think that creating a trainable embedding for a categorical variable one can retain (maybe even enrich?) the information on how each value (category) relates to the other and to the target.

The Rossman paper also mentions that they find it can sometimes provide good visual information for the category (by plotting the learned embedding on 2 dimensions), but that’s kind of typical use of trained embeddings in general.

Kind regards,
Theodore.