It depends on the machine learning algorithm you’re using. For a decision tree, it’s OK to encode categories using ordinal values (0, 1, 2, 3, etc). For an algorithm that learns a weight for each variable it’s not OK.
Let’s say we have a category
animal with three possible types: cow, goat, and pig. If we were to encode this as:
then a decision tree could write rules such as:
if animal == 0 then
do cow stuff
else if animal == 1 then
do goat stuff
else if animal == 2 then
do pig stuff
So there’s no problem there.
However, let’s say we have a logistic regression classifier or a neural network. Now the algorithm learns something like this:
prediction = weight * animal + ... + bias
In this case, if the animal is a pig, the predicted value will be higher than if it is a goat, and much higher than if it were a cow. The same weight is used for three different things. So here it’s not a good idea to use ordinal values to encode the categories.
Instead, we want to use an encoding where the distance between cow, goat, and pig is equal:
cow [1, 0, 0]
goat [0, 1, 0]
pig [0, 0, 1]
This is one-hot encoding. Note that if you treat each of these as a vector, the distance between each pair of animals is always
sqrt(2) (for Euclidian or L2-distance), or
1 (for L1-distance).
What the ML algorithm learns is now:
prediction = weight_cow * cow + weight_goat * goat + weight_pig * pig + ... + bias
Since only one of these at a time (
pig) can ever be 1, only one weight gets used and we can learn a weight for each individual type of animal.
As I mentioned, we can actually leave out one of these categories:
cow [1, 0]
goat [0, 1]
pig [0, 0]
The absence of cow and goat implies the thing is a pig. The distance between cow and goat is still sqrt(2) (or 1 if you’re using L1-distance) but between cow and pig it is 1. The square root of 2 is slightly larger than 1, but close enough. Plus it probably won’t matter if you look at what the ML algorithm now learns:
prediction = weight_cow * cow + weight_goat * goat + ... + bias
Here, the pig does not have its own weight. Is that a problem? To be honest, I don’t understand the mechanics of this enough, but I guess the bias term plays a role here.
Anyway… your question was about male vs female. You could encode it as:
male [1, 0]
female [0, 1]
That would certainly work. The ML algorithm learns:
prediction = weight_male * male + weight_female * female + ... + bias
But let’s say you’re encoding at as male = 1, female = 0, then what the ML algorithm learns is this:
prediction = weight_male * male + ... + bias
This is fine, since it can assign a large (positive or negative) weight for when being male is important to the prediction, and a small weight but large bias for when being female is more important than being male.
Of course, in a real classifier the formula for the prediction is more complicated (it probably won’t make decisions based on just male/female but the combination of male/female with other features), but the point is that with just two categories, 1 and 0 are enough for the classifier to make a useful distinction. You could also use 1 and -1, or 100 and 0, or 100 and -100, as long as the two values are different.
I hope this makes sense.