Dealing with missing categorical values

ksksingh022 · May 3, 2020, 7:21am

In blue bulldozer dataset i am unable to understand how are we dealing with the values which were missing in the categorical columns. Can anyone explain me?

ksksingh022 · May 9, 2020, 7:48am

can anyone please?

FraPochetti · May 9, 2020, 9:03am

Can you link the exact notebook you are referring to, please?
In any case, imputing missing values in categorical features generally defaults to replacing nans with a specific string such as missing, and then go on with the variable encoding.

krithi07 · May 11, 2020, 3:24pm

@ksksingh022 For all the non-integer variables we have first converted them to numeric using train_cats. Post this, we pass it to proc_df function which deals with missing values using fix_missing function which essentially replaces NAs with the median of that column.

For categorical variable, what gets used in place of missing values will depend very much on how your categories get coded. If your categories are Low (1), Medium (2), High (3) then your missing values will all be replaced with Medium. If the order is Medium-Low-High then the NAs will be replaced by Low.

For simplest substitution, it is a usual practice to replace missing values in categorical variables by the category that appears the most, mode(cat_variable). You can have proc_df work that way by ordering your categories accordingly.

P.S The codes for fastai functions are pretty readable and help answer questions like these.