Thank you in advance for reading my topic. I’ve adapted the “Rossman Notebook” to the Sberbank dataset from Kaggle (it’s small and has categorical data).
I run “proc_df” on my training data, but I get an error when I run “proc_df” on my testing data. The Rossman notebook shows this:
df_test, _, nas, mapper = proc_df(joined_test, ‘Sales’, do_scale=True, skip_flds=[‘Id’],
mapper=mapper, na_dict=nas)
‘Sales’ is clearly the response we are trying to predict. I believe the parameter that takes this value is ‘y_fld’. I have two questions:
- Why is the response included in the processing of the testing set? As I understand it, we’re not attempting to generate predictions.
- Is it possible to run proc_df and set ‘y_fld = None’ ? I’ve attempted to do this in my code, and I receive an error about null values.
My reasons for asking these questions is that the training set and the testing set are not the same length…it wouldn’t make sense to attach the response in the training set onto the testing set.
Please share your thoughts on this, thank you!