Categorify: Is One Hot Encoding never needed (even on names)?

bclc0597 · September 6, 2020, 6:00pm

Hey guys, this is my first time posting on this forum, I apologize if this question has been answered before (none that im aware of). Question is as follows:

In tabular.Categorify, it uses encode categories by labels (0,1,2…) instead of one hot encoding. My question is if say on a feature that contains a user’s name (in a recommendation system), would it be more appropriate to use one hot encoding on this column of features instead? Given that the only purpose of including this feature in the dataset is to uniquely identify each user (assuming all users have distinct names), so that the model would know which user does this row of history/data belong to.

In this case, would labelling each name 0,1,2… by Categorify not cause problems?

Patrick · September 6, 2020, 6:54pm

Hi Bill,

If a categorical variable is so high cardinality that every level is (nearly) unique, there is nothing for a model to learn (neural network or otherwise) to learn from that variable. With one datum per category, the model couldn’t even compute a meaningful summary statistic of the response variable - such as the mean or variance - for that category, so how could it find meaningful model parameters to apply to that category? You don’t want to keep identification, or ID, variables in the input data to any model. Those should be set to the side before feeding into the model and joined back later.

Best!

bclc0597 · September 7, 2020, 4:55am

Thanks for the reply Patrick! I do not entirely understand so I would like to ask a bit more if I may:

What do u mean by setting ID aside and joined back later? And also Im curious if unique ID wont be of any help to models, how can I let the model “recognize” certain users? For e.g. say an entry of data represents one user activity, and I am trying to let the model “recognize” activities belonging to this particular user in the training data.

Is this not possible? Or my concept/understanding of deep learning is just terribly out of touch.

Thank you!

Patrick · September 9, 2020, 3:12am

If a user has multiple observations associated with them, then that absolutely could be useful information for the model. But if a user only has a single observation associated with them, then there is nothing for the network to learn about such a user. So, when you write “…to uniquely identify each user…”, I interpreted that as potentially meaning a single observation for each user. Apologies if that was an incorrect assumption.

To your question about one-hot encoding, the models in fastai will learn an embedding for each category. You can think of the dimensions of the embedding as some unobservable but hopefully useful dimension which distinguishes the categories in a way that is pertinent to the purpose of your model (which is to minimize the loss). Generally, if you have more than a couple categories, the (entity) embedding technique will offer more flexibility and potentially superior performance than one-hot encoding.