Confusion about "proc_df" in lesson 3 rossman example

wywgong · March 29, 2018, 2:00pm

I’ve just run the source code about lesson3-rossman example to figure out each lines meaning. But when I read “prod_df” source code(which reside in fastai/structured.py) and run the dataframe through it I have confusion about the below source code, I don’t know why we need “pd.get_dummies” functions here

>     if do_scale: mapper = scale_vars(df, mapper)
>     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
>     res = [pd.get_dummies(df, dummy_na=True), y, na_dict]

the “scale_vars” function will do the StandardScaler for all the numeric columns
the “numericalize” function will turn all the categorical columns into integer category codes(numeric types)
so after these two function, all the columns in “df” has turned into numeric columns. so my question is why we need to call pd.get_dummies(df,dummy_na=True) to turn one-hot encoding in “df” since now the “df” don’t have any categorical column that means “get_dummies” function has no effect at all. Is that a redundant code or there is some scenario I haven’t figured out. Thanks

bilalUWE · December 19, 2018, 11:06am

To what I understood so far:

do_scale() standardizes numerical columns in the df
numericalize() replaces labels with internal integer codes for the category columns
get_dummies turns string columns (if there still exists any in the df) using one-hot encoding

diegobrito · September 5, 2019, 7:38pm

Notice the importance of the parameter max_n_cat on the numericalize function:

if not is_numeric_dtype(col) and ( max_n_cat is None or len(col.cat.categories)>max_n_cat):
    df[name] = pd.Categorical(col).codes+1

It determines which categoical columns are converted to numerical columns. The columns unnafected by this transformation will be transformed by the pandas to_dummies method.