Titanic - Why consider passenger class (Pclass = 1, 2 or 3) as a categorical variable and not a number (like SibSp or Parch or age)?

In this notebook, Jeremy writes:

We’ll create dummy variables for `Pclass` , even although it’s numeric, since the numbers `1` , `2` , and `3` correspond to first, second, and third class cabins - not to counts or measures that make sense to multiply by.

He also explains it in the video lesson.

When dealing with `Embarked`, which is C, Q or S, Jeremy says that coding C,Q,S to 0,1,2 or 2,0,1 or 1,0,2 or any of the 6 possible combinations won’t make a difference, because there is no order between C, Q and S. And it doesn’t make sense to add or multiply those values. It’s just a way to convert it to numbers. I agree with that, it makes sense.

But when dealing with `Pclass`, I don’t get it.
Passenger class is an ordered concept from 1 to 3, just like Parch (number of parents or children aboard) is an ordered concept from 0 to 6.

Why would it make sense to add/multiply things like Parch, but not Pclass? Someone having 2 parents/children onboard has more parents/children onboard than someone with Parch 1 and less parents/children onboard than someone with Parch 3. There is definitely an order, that is why Jeremy left it as numerical variable.

But in the same way, someone in class 2 has a higher class ticket than someone in class 1 and a lower class ticket than someone in class 3, that is definitely also an ordered concept, so why not using it as-is, as a number?

Can someone explain what I’m missing?

1 Like

It’s just a modeling decision about how to encode your variables, either with integer encoding or one hot encoding.

Meaning that you can make a different decision if you feel that it should be treated numerically. Ideally, you would run an experiment, and see which model works better.

Normally when things are in order and have some kind “multiplicative / additive nature” we use an integer encoding, and when categories are nominal (something like color perhaps or city) you would use one hot encoding. But as you rightly point out, sometimes there is a case for both integer encoding and one hot encoding.

When your input is a string, yes I agree.

But when the input is a number, there are 3 possible ways to handle it:

• keep the number as-is (numeric), that’s what Jeremy does with SibSp and Parch
• make it a category with integer encoding
• make it a category with one-hot encoding

My question was not really “integer encoding vs one hot encoding”, it was “numeric vs categorical”.

I mean why would Parch (integer range 0 to 8 with a logical order) be considered as a continuous/numeric variable but not Pclass (integer range 1 to 3 with a logical order)?

I mean why would Parch (integer range 0 to 8 with a logical order) be considered as a continuous/numeric variable but not Pclass (integer range 1 to 3 with a logical order)?

Again, I’d say it’s a modeling decision🤷‍♂️ you either keep things numeric or if you want more flexibility and you don’t want to assume some kind of linear effect to your input, then you can one hot encode it (although I guess neural networks can also learn nonlinear effects of integer encoded inputs, similar to what trees can do, it will just take more “layers” to build this mapping)

Also, Integer encoding and “keeping things numeric” is in terms of a neural networks exactly the same.

P-class is qualitative not quantitative. Categorical is sufficient, and is commonly used to encode ordinal categories.

1 Like

Thanks, interesting link, indeed ordinal data seems to sit somewhere between numerical and categorical.

What makes me want to treat Pclass as numerical is this sentence (from your article):

However, unlike categorical data, the numbers do have mathematical meaning. For example, if you survey 100 people and ask them to rate a restaurant on a scale from 0 to 4, taking the average of the 100 responses will have meaning. This would not be the case with categorical data.

What I mean is, if you create a category by doing something like `df['Pclass'] = pd.Categorical(df.Pclass)`, letting pandas assign an integer to each possible value, and you end up with something that breaks the order (say, for instance 0 = class 2 ; 1 = class 3 ; 2 = class 1), then you lose information, you lose the mathematical relationship between ticket classes that might have been a helpful feature. So on the risk side you have the risk of losing meaning. But even if this risk doesn’t exist because we assume pandas is smarter than that and will keep the order, still, on the benefit side, there seem to be no benefit at all transforming it into a category. This is what bothers me making Pclass a category. I mean, why bother? It’s already numerical and ordered!

Since I wrote this post, I opened the next notebook in the Titanic lesson (this one) and Jeremy wrote this:

Note that we no longer consider `Pclass` a categorical variable. That’s because it’s ordered (i.e 1st, 2nd, and 3rd class have an order), and decision trees, as we’ll see, only care about order, not about absolute value.

So I guess my thinking was somehow right. It’s not very clear to me why he decided it’s categorical at some point then decided it’s numerical later, but maybe that’s because both are correct approaches.

I wanted to show both so that you see that both are possible, and to encourage you to think about the options - which it appears worked very well in this case

1 Like

Ahah, I confirm, it worked really well!

1 Like

BTW as a rule of thumb for random forests, for ordinal data I find that with <=5 categories generally treating them as a category tends to work better, and >5 numerical works better. Of course there are exceptions.

2 Likes

Thank you, that’s a really useful rule of thumb to know