Correct sequence for TabularList add_test and Normalize

gevezex · March 4, 2019, 10:22pm

I am trying to create a databunch with the following method:

data_df = (TabularList.from_df(train_df[cols_for_training], path=’.’ , cont_names=cont_names, procs=[Normalize])
.split_by_idx(val_idxs)
.label_from_df(cols=dep_var)
.add_test(test_df)
.databunch())

This goed terribly wrong as I suspect that the training data is normalized and the test_df not. So the sequence of normalizing and then adding the test set does not work.

When I remove the “procs=[Normalize]” part it goes wel. So probably i need to give “.add_test(test_df)” before normalizing.

How do I do that with TabularList.from_df ?

Pak · March 27, 2019, 8:50pm

In fact when you add proper procs to your dataframe (data_df), system stores info on how it was normalized inside LabelList object (and as a consequence inside learner). And before predict on validation set (and I’m sure on any set/item test included) it applies these exact transformations (Normalize) to new data.
The problem can can be with this list of procs. I assume that [Normalize] alone can be safely used only if you are sure that there are no a) categorical (i.e. string) cells or b) missing cells in your data.
You can try to use procs=[FillMissing, Categorify, Normalize] in data_df = (TabularList.from_df(train_df[cols_for_training], path=’.’ , cont_names=cont_names, procs=procs) maybe this can help

gevezex · March 27, 2019, 10:50pm

Thnx for replying @Pak. In my example I did not have categorical features or NaN values. That’s why Normalize was the only proc I needed.

By the way I solved my issue by using a databunch with TabularDataBunch.from_df(), where I gave the test set in the argument. That works as designed.