Train_cats vs get_dummies

number007 · October 11, 2018, 10:37am

I am creating another thread for the question originally from wiki for Lesson 1, as I think it requires a separate discussion. Reproducing the question as-is:

While looking at the pandas documentation, I see a method called “get_dummies”:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

which can convert categorical values to indicator/dummy variables.

I ran it on the bulldozer dataset and the output is similar to “one hot encoding”.

So, I am wondering - which is a better method out of the two? Using train_cats to extract category codes or using get_dummies?

kva4 · October 11, 2018, 11:27am

Hi!
‘train_cats’ method is used for turning ‘string’ type columns into ‘category’ type columns. After the converting, you can pass these categorical columns to ‘get_dummies’ method. Because ‘get_dummies’ requires categorical variables as input. I guess we should use both because they do different work.

marcmuc · October 11, 2018, 1:40pm

There are different reasons for using either one hot encoding or just numericalizing a column. Some of the considerations are discussed in the ML course, ar least lessons 1 and 2 would be interesting to you as they deal with the entire preparation of the same dataset.

Btw, the fastai proc_df function has a parameter max_n_cats that uses the pd.get_dummies under the hood, so you can do that within that processing step if needed.

pnvijay · October 11, 2018, 2:45pm

Hi @marcmuc, Are you using that since proc_df uses pd.get_dummies under the hood, we can use proc_df and pd.get_dummies is not required. Can you please clarify?

marcmuc · October 11, 2018, 3:12pm

I just meant that you could use the proc_df function (which you will use anyways if you follow the course) to also produce the one hot encoded columns. You do not have to do that manually using pandas. But of course you could.

pnvijay · October 11, 2018, 3:18pm

Understood Thanks! I did not know that. Will explore the max_n_cats parameter to see how the enables one_hot_encoding.

number007 · October 12, 2018, 4:25am

I might have missed it, what were the considerations to numericalize a column vs one hot encoding?

rook · November 4, 2018, 4:04pm

There is a really good explanation of this here:

https://forums.fast.ai/t/to-label-encode-or-one-hot-encode/6057

number007 · November 19, 2018, 6:23am

after much experimentation I arrived at the same conclusion in the thread - it is better to use one-hot encoding most of the times.