TabularPandas Categorify with Kaggle Titanic Dataset Issue with PassengerId in to.valid.xs Validation Set

I have this issue when using the vanilla Kaggle Titanic competition dataset. When I use the proc Categorify and create a TabularPandas object, the categorical variable ‘PassengerId’ is set to 0 in the validation set and #na# in the dataloader validation set. When I run a test where I don’t use Categorify it leaves the PassengerId in the TabularePandas object validation set as is. Here is a code snippet of what I’ve done:

!kaggle competitions download -c titanic

train_data = pd.read_csv(path/‘train.csv’)
train_data.head()

#PassengerId shows up normally with a dataframe length of 891

dep_var = ‘Survived’
cat_vars = [‘PassengerId’, ‘Pclass’, ‘Sex’, ‘SibSp’, ‘Parch’, ‘Cabin’, ‘Embarked’]
cont_vars = [‘Age’, ‘Fare’]
procs = [Categorify, FillMissing, Normalize]
splits = IndexSplitter(list(range(710,891)))(range_of(train_data))
to = TabularPandas(train_data, procs, cat_vars, cont_vars, y_names=dep_var, splits=splits)
to.train.xs.PassengerId, to.valid.xs.PassengerId

#training has a valid PassengerId and validation has all 0’s

(0 1
1 2
2 3
3 4
4 5

705 706
706 707
707 708
708 709
709 710
Name: PassengerId, Length: 710, dtype: int16,
710 0
711 0
712 0
713 0
714 0

886 0
887 0
888 0
889 0
890 0
Name: PassengerId, Length: 181, dtype: int16)

dls = to.dataloaders(bs=64)
dls.valid.show_batch()

#the dataloader created from the TabularPandas has #na# for the validation set PassengerId

|PassengerId|Pclass|Sex|…
|#na#|1|female|
|#na#|1|male|
|#na#|1|male|

I may misunderstand what I should expect, I may have made a rookie mistake. I haven’t observed this with other tabular datasets though. I appreciate any assistance the community might offer.
Thanks,
Mark

1 Like

Perhaps someone could take a look at the gist of the code where I see the behavior: Titanic-ML

Thanks,
Mark

Ottomh 1) you only get categories it knows about from the training data hence everything for passengerid in validation is unknown, and 2) passengerid is not a category.

Btw I probably should have been clearer but I was on a phone and in a hurry.

ID has no place in your model in any form. It’s an arbitrary number with no predictive power. You should drop() it from the training/val data, and you should pop() it from the test data and literally just use it once you’ve got preds from the test data to create your submission CSV.

1 Like

Thanks Joe. I do follow your logic on the predictive power of an ID field which is at best a category with one member. :grinning: I just wondered why that was the only field that categorify didn’t seem to handle well. I’ll take it out as you suggest.

Mark

Likely because your training set never had a 0 value. Thus it’s always #na#

Thanks Zach. I must just not have wrapped my head around how the splits work for train and valid.

splits = IndexSplitter(list(range(710,891)))(range_of(train_data))

I thought this would basically give me a validation set hold back of the records in the dataframe from 710 to 891 that basically looks exactly the same as the training data (where the PassengerID’s happen to be incremental non-zero integers for each record), but is a set of records that the model/learner hasn’t seen so it can’t overfit?

Mark

The issue is that it can’t make sense of anything not in the training data. Imagine you’re predicting car prices and you’ve got a colour cat column, then in your validation data it comes across a neon pink car but it hasn’t seen one in the training data. What’s it meant to do with that?

Categories it doesn’t know about show up as NA and all your IDs from Val are unknown.

2 Likes

Ah! The light bulb is starting to flicker in my brain. I appreciate the example. I think that make sense now. Thanks much Zach and Joe!

Cheers, Mark

2 Likes