Categorify: Is One Hot Encoding never needed (even on names)?

Patrick · September 9, 2020, 3:12am

If a user has multiple observations associated with them, then that absolutely could be useful information for the model. But if a user only has a single observation associated with them, then there is nothing for the network to learn about such a user. So, when you write “…to uniquely identify each user…”, I interpreted that as potentially meaning a single observation for each user. Apologies if that was an incorrect assumption.

To your question about one-hot encoding, the models in fastai will learn an embedding for each category. You can think of the dimensions of the embedding as some unobservable but hopefully useful dimension which distinguishes the categories in a way that is pertinent to the purpose of your model (which is to minimize the loss). Generally, if you have more than a couple categories, the (entity) embedding technique will offer more flexibility and potentially superior performance than one-hot encoding.