TabularPandas: Use normalization on a separate dataset

mikey · August 10, 2022, 7:47pm

Hi,
I trained a tabular pandas model, and as part of this training, I normalized the dataset following the example in the book. I want to deploy my model on a separate dataset (“test_df”), but first I need to normalize the data in test_df using the same normalization routine (e.g., min max scaler using the same min and max) as was used with the training dataset. I assume there is a way to extract the normalization procedure from the TabularPandas, but I haven’t been able to figure it out. Here is a basic example:

procs_nn = [Categorify, Normalize]
splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to_nn = TabularPandas(df, procs_nn, 
    cat_names=None, cont_names=cont_nn, y_names='y',
    splits=splits)

Ideally, there would be some way for me to normalize my test_df the same way these training/validation data were normalized. E.g.,

to_nn.normalize(test_df)

Except when I do this, the output is identical to test_df. The column names in df and test_df are identical.

I am using fastai v. 2.7.7
Note I also posted this in the 2019 forum yesterday. Sorry for crossposting, but I’m new here and learning my way around.

muellerzr · August 10, 2022, 9:50pm

Check out my walk with fastai guide on this topic Exporting `TabularPandas` for Inference (Intermediate) | walkwithfastai

(Towards the bottom it shows how to process new data)

mikey · August 11, 2022, 1:27pm

Thank you for sharing your tutorial, @muellerzr ! The simple answer is:

# make a normalized test_df
to_test = to_nn.new(test_df)
to_test.process()

# make a dataloader for this testing dataset
test_dl = to_test.dataloaders(1)

# then I can use my learner to get predictions
preds,_ = learn.get_preds(dl=test_dl)

Essentially to_nn.normalize(test_df) does not work as you would expect, you need to instead use to_nn.new(test_df) and then run .process() on that object.