I have a simple question about how to handle boolean values with TabularPandas.
Should they be treated as categorical variables?
If I do this they get encoded, but there are only 2 embeddings, so this is functionally the same a boolean, right?
I would turn booleans into float Tensors of 0.0 and 1.0. But maybe using embeddings is smarter. Have you tried to run
cont_cat_split on your data? Are the booleans part of conts or cats?
A little bit of my reasoning:
If there wasn’t fastai and its cool features we would have to encode every feature that is not a number. But how? If we have A,B,C should we turn it into A->1,B->2,C->3 (ordinal encoding) or some other constellation? One-hot encoding would basically create three columns A,B,C and each of them is a boolean indicating which of them was in the original column. But if we have features with a lot of unique values that would result in a lot of new features that are 0 most of the time.
In my understanding embeddings are a solution for both
- they learn their values from the data (you don’t have to be good at guessing like in ordinal encoding)
- your amount of features doesn’t grow as much as if you would use one-hot encoding
But I don’t know if embeddings can make more out of boolean columns. I guess this something that we should experiement. And probably we should experiment with it on every new dataset.