One hot encoding for large number of categorical values?

keyurparalkar · February 5, 2020, 5:01pm

Hello Everyone,

I am working on the Data-science bowl 2019 competition. For the given dataset I have a column called as installation id. It is a string that consists of both numbers as well as characters e.g. 0006a69f, 0006c192,etc. When I try to fit this dataset with sklearn’s DecisionTreeClassifier I get an error as follows:

could not convert string to float: '0006a69f'.

This gave me an idea of converting this column into categorical data. Columns installation_id has 3614 unique values. So I thought of converting this column via one-hot encoding method. But this might create a lot of columns i.e. 3614.

Can anyone suggest me what should be done in this case? Is one-hot encoding a better option or are there any other methods to convert it into numerical categorical data?

Thanks.

piby4 · February 6, 2020, 3:29am

if OHE creating a huge sparse matrix is the problem, you can use
sklearn.preprocessing.LabelEncoder

This will create the categorical values ( essentially a mapping ) in one column as numeric values.
( Some algorithms -( eg. xgboost ) would take in only numeric values for categorical values - so this method might be useful.
some Algorithms - ( eg:catboost , lightgbm) , can handle numeric or string as categorical variable.