I build & trained a model on top of the tabular learner to predict numerical values and meanwhile, it’s working well with a pretty good degree of accuracy However, my input data are (float) number like so:
id | date | value
3691 2019-04-18 1245.00
3692 2019-04-22 1236.67
Because I use normalization as pre-processor, the resulting target / prediction values are normalized values like so:
I have another question, it seems my model overfits to some degree because the test data do just fine, but when validating on new data the model has never seen before, the predictions tend to be a lot more off the map.
Any idea how to twek the model to generalize better?
If I would face such situation, my first thought would be that my validation/test data are some different to new ‘realworld’ data. So I would try to look at these datasets and analyze the difference (and then try do make validation/test and new data more similar)
Or maybe if I’ve tested too much, such as I’ve ‘metaoverfitted’ to my validation data and I need to separate some test dataset for additional validation
The other thought would be to play with (maybe add some more) regularization
But, I don’t have much experience here. I’ve tried only a couple of cases with realworld data with Tabular Models.
The test / validation set is numerically further away from the train set,
so yes, there is a difference in the data. The elephant, however, is the fairly small dataset so I guess I need a bit more than just regularization and possibly look into data augmentation and cross-validation.
That said, I get a sense to better phase out the tabular learner soon.