I am creating another thread for the question originally from wiki for Lesson 1, as I think it requires a separate discussion. Reproducing the question as-is:
While looking at the pandas documentation, I see a method called “get_dummies”:
Hi!
‘train_cats’ method is used for turning ‘string’ type columns into ‘category’ type columns. After the converting, you can pass these categorical columns to ‘get_dummies’ method. Because ‘get_dummies’ requires categorical variables as input. I guess we should use both because they do different work.
There are different reasons for using either one hot encoding or just numericalizing a column. Some of the considerations are discussed in the ML course, ar least lessons 1 and 2 would be interesting to you as they deal with the entire preparation of the same dataset.
Btw, the fastai proc_df function has a parameter max_n_cats that uses the pd.get_dummies under the hood, so you can do that within that processing step if needed.
Hi @marcmuc, Are you using that since proc_df uses pd.get_dummies under the hood, we can use proc_df and pd.get_dummies is not required. Can you please clarify?
I just meant that you could use the proc_df function (which you will use anyways if you follow the course) to also produce the one hot encoded columns. You do not have to do that manually using pandas. But of course you could.