Problem with test set in TabularPandas

murodbek · December 4, 2022, 9:23am

Hello, there. When playing with Tabular Playground Series - May 2022 at kaggle, I have noticed something. I created Tabular Pandas object for train dataset like that:

to=TabularPandas(df_train, procs=[Categorify, Normalize], cat_names=cat, 
                 cont_names=cont, y_names=dep_var, y_block=CategoryBlock(), 
                 splits=RandomSplitter()(range_of(df_train)))

After training my model on Decision Trees, I want to make a prediction from test so I created Tabular Pandas object for test dataset like that:

to_test=TabularPandas(df_test, procs=[Categorify, Normalize], cat_names=cat, 
                 cont_names=cont)
tst_xs=to_test.train.xs

As you noticed these are 2 separate TP objects with different means and standard deviations. As well as great loss on the performance of a model.

To learn more about I made my notebook visible to anyone: TabularPS May 2022 | Kaggle

P.S. I should also mention that I am not expert practitioner and I am new there, so any resources & links for solving this problem is appreciated.

P.S. 2: Sorry for the title

muellerzr · December 4, 2022, 1:35pm

See this code here for how to preprocess a new set (ignore the export part) Exporting `TabularPandas` for Inference (Intermediate) | walkwithfastai

murodbek · December 4, 2022, 2:17pm

Thank you, it worked