To Label Encode or One Hot Encode?

machinethink · October 8, 2017, 7:01pm

It depends on the machine learning algorithm you’re using. For a decision tree, it’s OK to encode categories using ordinal values (0, 1, 2, 3, etc). For an algorithm that learns a weight for each variable it’s not OK.

Let’s say we have a category animal with three possible types: cow, goat, and pig. If we were to encode this as:

cow  0
goat 1
pig  2

then a decision tree could write rules such as:

if animal == 0 then
  do cow stuff
else if animal == 1 then
  do goat stuff
else if animal == 2 then
  do pig stuff

So there’s no problem there.

However, let’s say we have a logistic regression classifier or a neural network. Now the algorithm learns something like this:

prediction = weight * animal + ... + bias

In this case, if the animal is a pig, the predicted value will be higher than if it is a goat, and much higher than if it were a cow. The same weight is used for three different things. So here it’s not a good idea to use ordinal values to encode the categories.

Instead, we want to use an encoding where the distance between cow, goat, and pig is equal:

cow  [1, 0, 0]
goat [0, 1, 0]
pig  [0, 0, 1]

This is one-hot encoding. Note that if you treat each of these as a vector, the distance between each pair of animals is always sqrt(2) (for Euclidian or L2-distance), or 1 (for L1-distance).

What the ML algorithm learns is now:

prediction = weight_cow * cow + weight_goat * goat + weight_pig * pig + ... + bias

Since only one of these at a time (cow, goat, or pig) can ever be 1, only one weight gets used and we can learn a weight for each individual type of animal.

As I mentioned, we can actually leave out one of these categories:

cow  [1, 0]
goat [0, 1]
pig  [0, 0]

The absence of cow and goat implies the thing is a pig. The distance between cow and goat is still sqrt(2) (or 1 if you’re using L1-distance) but between cow and pig it is 1. The square root of 2 is slightly larger than 1, but close enough. Plus it probably won’t matter if you look at what the ML algorithm now learns:

prediction = weight_cow * cow + weight_goat * goat + ... + bias

Here, the pig does not have its own weight. Is that a problem? To be honest, I don’t understand the mechanics of this enough, but I guess the bias term plays a role here.

Anyway… your question was about male vs female. You could encode it as:

male   [1, 0]
female [0, 1]

That would certainly work. The ML algorithm learns:

prediction = weight_male * male + weight_female * female + ... + bias

But let’s say you’re encoding at as male = 1, female = 0, then what the ML algorithm learns is this:

prediction = weight_male * male + ... + bias

This is fine, since it can assign a large (positive or negative) weight for when being male is important to the prediction, and a small weight but large bias for when being female is more important than being male.

Of course, in a real classifier the formula for the prediction is more complicated (it probably won’t make decisions based on just male/female but the combination of male/female with other features), but the point is that with just two categories, 1 and 0 are enough for the classifier to make a useful distinction. You could also use 1 and -1, or 100 and 0, or 100 and -100, as long as the two values are different.

I hope this makes sense.