In lesson #1 of the machine learning course we use the proc_df to process our bulldozers dataset. The proc_df takes the dataframe and returns:
1 - the df + columns with _na at the end if they contain na values
2 - a y column as a label set
3 - a dictionary of na values
I am using the same structure on the house price kaggle competition - the problem is that the columns holding nas in the training set are different to the validation set.
When I proc_df the validation set I get more columns with nas and therefore more columns with _na and so this model wont fit.
I sorted it by manually removing the additional columns in the validation set, but this doesn’t seem like a great solution.
I worked it out for anyone who is interested.
You pass nas as na_dict into both the train and validation set - this updates the columns if they are different ensuring that both dataframes have an equal number of columns.
train_df, y, nas = proc_df(df_raw, ‘SalePrice’, na_dict=nas)
test_df, _, _ = proc_df(df_test, na_dict=nas)
Hope this makes sense if anyone else was stuck here.