Is one hot encoding not required in fastai?

ForBo7 · January 18, 2023, 7:15am

Hello.

So I’m trying to figure out how to one hot encode in fastai. I couldn’t figure out how so referred back to a chapter in the fastbook and came across this snippet.

In the previous chapter, when working with deep learning networks, we dealt with categorical variables by one-hot encoding them and feeding them to an embedding layer. The embedding layer helped the model to discover the meaning of the different levels of these variables (the levels of a categorical variable do not have an intrinsic meaning, unless we manually specify an ordering using Pandas). In a decision tree, we don’t have embeddings layers—so how can these untreated categorical variables do anything useful in a decision tree? For instance, how could something like a product code be used?

Edit: Specifically this snippet from the quote above.

The embedding layer helped the model to discover the meaning of the different levels of these variables (the levels of a categorical variable do not have an intrinsic meaning, unless we manually specify an ordering using Pandas).

Does this mean that one hot encoding is not required in fastai? When I create a Learner and see its model, I see that is has embedding layers.

nglillywhite · January 23, 2023, 6:49am

Hey,

From my understanding, fastai will handle your categorical variables for you as long as you nominate them as you mentioned by creating embedding layers. If I understand your question the way I think you mean it, yes, one-hot encoding isn’t ‘required’ since the library will handle these categorical features for you if you nominate them.

But it doesn’t mean you can’t, or shouldn’t, like everything it probably depends on the situation. That’s my naive understanding at least, hope that helps

ForBo7 · January 23, 2023, 1:16pm

Thanks for the response! It does help! Wanted to confirm my understanding.

By nominating them, you mean that I categorify those variables before creating the learner, right?

But I think your understanding is right. I stumbled across this post a few days ago.

The second paragraph says:

To your question about one-hot encoding, the models in fastai will learn an embedding for each category. You can think of the dimensions of the embedding as some unobservable but hopefully useful dimension which distinguishes the categories in a way that is pertinent to the purpose of your model (which is to minimize the loss). Generally, if you have more than a couple categories, the (entity) embedding technique will offer more flexibility and potentially superior performance than one-hot encoding.

So fastai most likely does handle categorical variables by creating embedding layers.

From what I’ve read, I think embedding layers are better than one hot encoding, which is also why fastai uses them. The paragraph above also states that they are better.