Proc_df() for machine learning course

Hey guys,
In lesson #1 of the machine learning course we use the proc_df to process our bulldozers dataset. The proc_df takes the dataframe and returns:
1 - the df + columns with _na at the end if they contain na values
2 - a y column as a label set
3 - a dictionary of na values
I am using the same structure on the house price kaggle competition - the problem is that the columns holding nas in the training set are different to the validation set.
When I proc_df the validation set I get more columns with nas and therefore more columns with _na and so this model wont fit.
I sorted it by manually removing the additional columns in the validation set, but this doesn’t seem like a great solution.
Any ideas???

1 Like

I worked it out for anyone who is interested.

You pass nas as na_dict into both the train and validation set - this updates the columns if they are different ensuring that both dataframes have an equal number of columns.

so:
train_df, y, nas = proc_df(df_raw, ‘SalePrice’, na_dict=nas)
test_df, _, _ = proc_df(df_test, na_dict=nas)

Hope this makes sense if anyone else was stuck here.

8 Likes

Cheers :wink:

1 Like

Hi @jeremy,

I think there is a bug in the function proc_df() : if there is no missing data in the training set, after processing the training set with proc_df(), nas = {} but then, when we use nas to process proc_df() on a test set - with at least one missing value - , the columns _na which are created are not erased as they should be.
Because of that, the model can not work on the test set.

The error comes from the following if in the def of proc_df() (file structured.py) :

if len(na_dict_initial.keys()) > 0:
    df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)

We should not keep if len(na_dict_initial.keys()) > 0: and the code should be :

df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)

Do you agree ?

@pierreguillou maybe you could try to create a few test cases for the different ways this function should work, and then submit a PR with those tests and corrected code?

Hi @jeremy. You’re right. I need to do more test cases on that before to submit a PR. I used the proc_df() function today and it worked well.