I have this issue when using the vanilla Kaggle Titanic competition dataset. When I use the proc Categorify and create a TabularPandas object, the categorical variable ‘PassengerId’ is set to 0 in the validation set and #na# in the dataloader validation set. When I run a test where I don’t use Categorify it leaves the PassengerId in the TabularePandas object validation set as is. Here is a code snippet of what I’ve done:
!kaggle competitions download -c titanic
train_data = pd.read_csv(path/‘train.csv’)
train_data.head()
#PassengerId shows up normally with a dataframe length of 891
dep_var = ‘Survived’
cat_vars = [‘PassengerId’, ‘Pclass’, ‘Sex’, ‘SibSp’, ‘Parch’, ‘Cabin’, ‘Embarked’]
cont_vars = [‘Age’, ‘Fare’]
procs = [Categorify, FillMissing, Normalize]
splits = IndexSplitter(list(range(710,891)))(range_of(train_data))
to = TabularPandas(train_data, procs, cat_vars, cont_vars, y_names=dep_var, splits=splits)
to.train.xs.PassengerId, to.valid.xs.PassengerId
#training has a valid PassengerId and validation has all 0’s
(0 1
1 2
2 3
3 4
4 5
…
705 706
706 707
707 708
708 709
709 710
Name: PassengerId, Length: 710, dtype: int16,
710 0
711 0
712 0
713 0
714 0
…
886 0
887 0
888 0
889 0
890 0
Name: PassengerId, Length: 181, dtype: int16)
dls = to.dataloaders(bs=64)
dls.valid.show_batch()
#the dataloader created from the TabularPandas has #na# for the validation set PassengerId
|PassengerId|Pclass|Sex|…
|#na#|1|female|
|#na#|1|male|
|#na#|1|male|
I may misunderstand what I should expect, I may have made a rookie mistake. I haven’t observed this with other tabular datasets though. I appreciate any assistance the community might offer.
Thanks,
Mark
I just wondered why that was the only field that categorify didn’t seem to handle well. I’ll take it out as you suggest.