To Label Encode or One Hot Encode?

wgpubs · October 7, 2017, 10:31pm

When is it appropriate to use label encoding vs. one hot encoding and vice-versa?

What got me thinking about this was working through the Kaggle Titatnic dataset and taking note of how most folks were handling the “Sex” column, which has no missing values and is either “Male” or “Female”. Almost everyone simply uses label encoding for this feature, and that encoding is always Male=1, Female=0.

BUT, my understanding of label encoding is it only makes sense when they represent “a natural ordered relationship between each other” (see here). If there isn’t such a relationship then label encoding can actually be misinterpreted by your model to think that one value is more important simply because it is encoded with a higher integer.

As such, since wrt Titanic, females represent 2/3 of the survivors, wouldn’t it be better to use Female=1 and Male=0 encoding? Or, since there isn’t a hierarchical relationship between the two values, wouldn’t it in fact be better to simply use one-hot encoding?

machinethink · October 8, 2017, 8:27am

When encoding male/female as 1/0 (or 0/1) you’re basically doing one-hot encoding except you’re using just one value instead of two.

In statistics, when using one-hot encoding, it is common to leave out one of the columns because it can be inferred as being the thing that is absent (i.e. if all columns in the one-hot encoded vector are 0, then it must be the “other” thing).

wgpubs · October 8, 2017, 6:03pm

But it isn’t really one-hot encoding because for one of the genders, the value is always 0.

I understand that in stats that the column can be inferred, but my question is more about how different classifiers or models may interpret or misinterpret the value? From what I’ve read, it seems like they may infer that one is value is of greater importance because it is encoded with a higher integer.

machinethink · October 8, 2017, 7:01pm

It depends on the machine learning algorithm you’re using. For a decision tree, it’s OK to encode categories using ordinal values (0, 1, 2, 3, etc). For an algorithm that learns a weight for each variable it’s not OK.

Let’s say we have a category animal with three possible types: cow, goat, and pig. If we were to encode this as:

cow  0
goat 1
pig  2

then a decision tree could write rules such as:

if animal == 0 then
  do cow stuff
else if animal == 1 then
  do goat stuff
else if animal == 2 then
  do pig stuff

So there’s no problem there.

However, let’s say we have a logistic regression classifier or a neural network. Now the algorithm learns something like this:

prediction = weight * animal + ... + bias

In this case, if the animal is a pig, the predicted value will be higher than if it is a goat, and much higher than if it were a cow. The same weight is used for three different things. So here it’s not a good idea to use ordinal values to encode the categories.

Instead, we want to use an encoding where the distance between cow, goat, and pig is equal:

cow  [1, 0, 0]
goat [0, 1, 0]
pig  [0, 0, 1]

This is one-hot encoding. Note that if you treat each of these as a vector, the distance between each pair of animals is always sqrt(2) (for Euclidian or L2-distance), or 1 (for L1-distance).

What the ML algorithm learns is now:

prediction = weight_cow * cow + weight_goat * goat + weight_pig * pig + ... + bias

Since only one of these at a time (cow, goat, or pig) can ever be 1, only one weight gets used and we can learn a weight for each individual type of animal.

As I mentioned, we can actually leave out one of these categories:

cow  [1, 0]
goat [0, 1]
pig  [0, 0]

The absence of cow and goat implies the thing is a pig. The distance between cow and goat is still sqrt(2) (or 1 if you’re using L1-distance) but between cow and pig it is 1. The square root of 2 is slightly larger than 1, but close enough. Plus it probably won’t matter if you look at what the ML algorithm now learns:

prediction = weight_cow * cow + weight_goat * goat + ... + bias

Here, the pig does not have its own weight. Is that a problem? To be honest, I don’t understand the mechanics of this enough, but I guess the bias term plays a role here.

Anyway… your question was about male vs female. You could encode it as:

male   [1, 0]
female [0, 1]

That would certainly work. The ML algorithm learns:

prediction = weight_male * male + weight_female * female + ... + bias

But let’s say you’re encoding at as male = 1, female = 0, then what the ML algorithm learns is this:

prediction = weight_male * male + ... + bias

This is fine, since it can assign a large (positive or negative) weight for when being male is important to the prediction, and a small weight but large bias for when being female is more important than being male.

Of course, in a real classifier the formula for the prediction is more complicated (it probably won’t make decisions based on just male/female but the combination of male/female with other features), but the point is that with just two categories, 1 and 0 are enough for the classifier to make a useful distinction. You could also use 1 and -1, or 100 and 0, or 100 and -100, as long as the two values are different.

I hope this makes sense.

machinethink · October 8, 2017, 8:31pm

I thought I’d add an example. Let’s say we have two variables: male/female and age, and we want to predict if a person survives or not based on these two variables.

The model could be:

prediction = weight_male * male + weight_age * age + bias

Let’s say the older someone is, the more likely they were to survive the Titanic (not necessarily true but this is only a simple model). Let’s also say women had a higher chance to survive than men (“women and children first!”).

Since high age increases the likelihood of survival, weight_age is some positive number.

Since being male decreases the chance of survival, weight_male would be some negative number.

So maybe the model learns something like this:

prediction = -10 * male + 2 * age

where age is normalized to be between 0 and 1. (I left off the bias.)

If someone is 30 years and male, the score would be -10 + 60 = 50. If someone is 30 years old and female, the score would be 0 + 60 = 60. So in effect, there is a -10 penalty for being male in this model.

(Of course, to get a survival yes/no prediction, we need to turn this number into a probability, maybe using a sigmoid function. But that’s not important right now.)

What if female was encoded as 1 and male as 0? The model might now be:

prediction = 10 * female + 2 * age - 10

This time there is a bias (of -10), to penalize males. Again, a 30-year-old male would score 0 + 60 - 10 = 50, and a 30-year-old female would score 10 + 60 - 10 = 60.

So it doesn’t really matter whether we encoded male or female as 1 or 0, since the model can learn to deal with it either way.

wgpubs · October 9, 2017, 7:00pm

Thanks for the thorough replies @machinethink!

It seems like the safer approach is to default to onehot encoding unless there is some natural order the label encoding captures better.

I’m still playing with the Titanic dataset and will try both approaches to see what, if anything changes signficantly with the predictions.

yayan · October 10, 2017, 6:01am

Please report us your findings! This topic you opened is quite interesting

rbwendt · October 10, 2017, 1:55pm

Perhaps it would make more sense to use the values -1 and 1, like the heaviside activation in a single perceptron generally uses, then you’d have the same gradient in opposite directions. Having said that, I have tried a perceptron with 0 and 1 and it still works, I just think the learning rate isn’t as good in the one direction, since the loss isn’t symmetric.

mllearner11 · October 16, 2021, 8:50pm

@machinethink Thanks for the beautiful explanation.

Can we use one-hot encoding for categorical features when a data point belongs to more than 1 type of a particular category? I assume we should be able to, since we have different weights for each type.

tatv047 · June 1, 2025, 10:54am

Thank you so much, it helped a lot