Train_cats vs get_dummies


#1

I am creating another thread for the question originally from wiki for Lesson 1, as I think it requires a separate discussion. Reproducing the question as-is:

While looking at the pandas documentation, I see a method called “get_dummies”:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

which can convert categorical values to indicator/dummy variables.

I ran it on the bulldozer dataset and the output is similar to “one hot encoding”.

So, I am wondering - which is a better method out of the two? Using train_cats to extract category codes or using get_dummies?


(Oleksandr Bratchyk) #2

Hi!
‘train_cats’ method is used for turning ‘string’ type columns into ‘category’ type columns. After the converting, you can pass these categorical columns to ‘get_dummies’ method. Because ‘get_dummies’ requires categorical variables as input. I guess we should use both because they do different work.


(Marc Rostock) #3

There are different reasons for using either one hot encoding or just numericalizing a column. Some of the considerations are discussed in the ML course, ar least lessons 1 and 2 would be interesting to you as they deal with the entire preparation of the same dataset.

Btw, the fastai proc_df function has a parameter max_n_cats that uses the pd.get_dummies under the hood, so you can do that within that processing step if needed.


(Vijay Narayanan Parakimeethal) #4

Hi @marcmuc, Are you using that since proc_df uses pd.get_dummies under the hood, we can use proc_df and pd.get_dummies is not required. Can you please clarify?


(Marc Rostock) #5

I just meant that you could use the proc_df function (which you will use anyways if you follow the course) to also produce the one hot encoded columns. You do not have to do that manually using pandas. But of course you could.


(Vijay Narayanan Parakimeethal) #6

Understood Thanks! I did not know that. Will explore the max_n_cats parameter to see how the enables one_hot_encoding.


#7

I might have missed it, what were the considerations to numericalize a column vs one hot encoding?