Rossman Notebook: "proc_df" uses response variable with test dataframe?

EmptyCasket · May 26, 2018, 9:33pm

Thank you in advance for reading my topic. I’ve adapted the “Rossman Notebook” to the Sberbank dataset from Kaggle (it’s small and has categorical data).

I run “proc_df” on my training data, but I get an error when I run “proc_df” on my testing data. The Rossman notebook shows this:

df_test, _, nas, mapper = proc_df(joined_test, ‘Sales’, do_scale=True, skip_flds=[‘Id’],
mapper=mapper, na_dict=nas)

‘Sales’ is clearly the response we are trying to predict. I believe the parameter that takes this value is ‘y_fld’. I have two questions:

Why is the response included in the processing of the testing set? As I understand it, we’re not attempting to generate predictions.
Is it possible to run proc_df and set ‘y_fld = None’ ? I’ve attempted to do this in my code, and I receive an error about null values.

My reasons for asking these questions is that the training set and the testing set are not the same length…it wouldn’t make sense to attach the response in the training set onto the testing set.

Please share your thoughts on this, thank you!

ThomVett · October 9, 2018, 5:59am

Hi,

I just ran into the same question myself and my issue was that I had not run apply_cats on the test set.
Before running proc_df on the test set you should also assign it the same categories as the training set

train_cats(df_train)
apply_cats(df_train, df_test)

Regarding your first question you do not need the y in the testing set, so setting it to None works just as well.