I am working on the Data-science bowl 2019 competition. For the given dataset I have a column called as installation id. It is a string that consists of both numbers as well as characters e.g. 0006a69f, 0006c192,etc. When I try to fit this dataset with sklearn’s DecisionTreeClassifier I get an error as follows:
could not convert string to float: '0006a69f'.
This gave me an idea of converting this column into categorical data. Columns installation_id has 3614 unique values. So I thought of converting this column via one-hot encoding method. But this might create a lot of columns i.e. 3614.
Can anyone suggest me what should be done in this case? Is one-hot encoding a better option or are there any other methods to convert it into numerical categorical data?