[Solved] How to do data de-normalization for prediction?

marvin · April 24, 2019, 4:48pm

Hi,

I build & trained a model on top of the tabular learner to predict numerical values and meanwhile, it’s working well with a pretty good degree of accuracy However, my input data are (float) number like so:

id | date | value
3691 2019-04-18 1245.00
3692 2019-04-22 1236.67

Because I use normalization as pre-processor, the resulting target / prediction values are normalized values like so:

target | prediction
7.242891 [7.207101]
7.218499 [7.148138]

How do I convert these predicted values back to my (float) numbers so that I can start using the model in practice?

Details in this gist:

gist.github.com

https://gist.github.com/marvin-hansen/42798c3aaf5a4e945dc4a4b8c88b86f6

Infer.py

# extract validation dataset 
nr_valid = 5
valid_df = data.tail(nr_valid)
# FIXES: Exception: There are nan values in field ... 
# https://forums.fast.ai/t/there-are-nan-values-in-field-but-there-were-none-in-the-training-set/32260
valid_df = valid_df.fillna(0)
# remove validation data before splitting in train & test
data = data.iloc[:-nr_valid]

### Train Model ###

This file has been truncated. show original

prep.py

def prepare(data, split_ratio=0.80, dep_var=""):
  # prepare data 
  train_df, test_df =  prepare_data(data=data, split_ratio=split_ratio)
  # define procs
  procs=[FillMissing, Categorify, Normalize]
  #procs=[FillMissing, Categorify]

  # Create Data bunch
  data = create_data_bunch(train_df, test_df, procs, dep_var)

This file has been truncated. show original

Pak · April 24, 2019, 5:21pm

It’s not a matter of normalization. It’s log-ed versions of the predictions (log=True). You can turn it to usual numbers with np.exp(preds)

marvin · April 24, 2019, 6:10pm

Thanks @Pak

that did the trick.

I have another question, it seems my model overfits to some degree because the test data do just fine, but when validating on new data the model has never seen before, the predictions tend to be a lot more off the map.

Any idea how to twek the model to generalize better?

Pak · April 24, 2019, 8:31pm

If I would face such situation, my first thought would be that my validation/test data are some different to new ‘realworld’ data. So I would try to look at these datasets and analyze the difference (and then try do make validation/test and new data more similar)
Or maybe if I’ve tested too much, such as I’ve ‘metaoverfitted’ to my validation data and I need to separate some test dataset for additional validation
The other thought would be to play with (maybe add some more) regularization

But, I don’t have much experience here. I’ve tried only a couple of cases with realworld data with Tabular Models.

marvin · April 25, 2019, 3:53pm

@Pak

that is exactly the crux:

The test / validation set is numerically further away from the train set,
so yes, there is a difference in the data. The elephant, however, is the fairly small dataset so I guess I need a bit more than just regularization and possibly look into data augmentation and cross-validation.

That said, I get a sense to better phase out the tabular learner soon.

Thank you