TabularPandas how to treat boolean?

TimK · August 18, 2021, 1:34pm

Hi,
I have a simple question about how to handle boolean values with TabularPandas.
Should they be treated as categorical variables?
If I do this they get encoded, but there are only 2 embeddings, so this is functionally the same a boolean, right?
Thanks,
Tim

JackByte · August 19, 2021, 8:48pm

Hi @TimK,

I would turn booleans into float Tensors of 0.0 and 1.0. But maybe using embeddings is smarter. Have you tried to run cont_cat_split on your data? Are the booleans part of conts or cats?

A little bit of my reasoning:
If there wasn’t fastai and its cool features we would have to encode every feature that is not a number. But how? If we have A,B,C should we turn it into A->1,B->2,C->3 (ordinal encoding) or some other constellation? One-hot encoding would basically create three columns A,B,C and each of them is a boolean indicating which of them was in the original column. But if we have features with a lot of unique values that would result in a lot of new features that are 0 most of the time.

In my understanding embeddings are a solution for both

they learn their values from the data (you don’t have to be good at guessing like in ordinal encoding)
your amount of features doesn’t grow as much as if you would use one-hot encoding

But I don’t know if embeddings can make more out of boolean columns. I guess this something that we should experiement. And probably we should experiment with it on every new dataset.

Cheers