To Label Encode or One Hot Encode?

machinethink · October 8, 2017, 8:31pm

I thought I’d add an example. Let’s say we have two variables: male/female and age, and we want to predict if a person survives or not based on these two variables.

The model could be:

prediction = weight_male * male + weight_age * age + bias

Let’s say the older someone is, the more likely they were to survive the Titanic (not necessarily true but this is only a simple model). Let’s also say women had a higher chance to survive than men (“women and children first!”).

Since high age increases the likelihood of survival, weight_age is some positive number.

Since being male decreases the chance of survival, weight_male would be some negative number.

So maybe the model learns something like this:

prediction = -10 * male + 2 * age

where age is normalized to be between 0 and 1. (I left off the bias.)

If someone is 30 years and male, the score would be -10 + 60 = 50. If someone is 30 years old and female, the score would be 0 + 60 = 60. So in effect, there is a -10 penalty for being male in this model.

(Of course, to get a survival yes/no prediction, we need to turn this number into a probability, maybe using a sigmoid function. But that’s not important right now.)

What if female was encoded as 1 and male as 0? The model might now be:

prediction = 10 * female + 2 * age - 10

This time there is a bias (of -10), to penalize males. Again, a 30-year-old male would score 0 + 60 - 10 = 50, and a 30-year-old female would score 10 + 60 - 10 = 60.

So it doesn’t really matter whether we encoded male or female as 1 or 0, since the model can learn to deal with it either way.