Experimenting with the titanic kaggle competition I stumbled first on a bug and then a question of what happens when categorical sets in test and train data sets are different?
In Titanic dataset this is the case for column Parch. The test ds has 8 classes, and train ds only 7.
Consider:
for c in cat_vars:
train_df[c] = train_df[c].astype('category').cat.as_ordered()
apply_cats(test_df, train_df)
np.unique(train_df['Parch'])
np.unique(test_df['Parch'])
array([0, 1, 2, 3, 4, 5, 6])
array([ 0., 1., 2., 3., 4., 5., 6., nan, nan])
so apply_cats
just deleted the 8th class in the test set as if it has never existed. So the unknown/new class is equal to a non-existing class, yet, the train dataset didn’t have non-existing nan class (in this particular case). So what does it mean for the prediction - how does NN handle this case?
To clarify my question: What happens when during prediction on the test set there appears a data point that is NaN, whereas the train set never had a NaN for the same variable?
And I’m also puzzled at how apply_cats
turned int
s into float
s, I don’t see any conversions here:
def apply_cats(df, trn):
for n,c in df.items():
if (n in trn.columns) and (trn[n].dtype.name=='category'):
print(trn[n].cat.categories) # checked it to be integers
df[n] = pd.Categorical(c, categories=trn[n].cat.categories, ordered=True)
edit: it appears that this happens because of NaN values which casts the whole array from ‘int’ to ‘float’. proc_df() later rectifies that. Supposedly this behavior (inability to have int NaNs will get fixed in pandas2 whenever that will happen, see here and here.
and while we are at it, just for others who might find it useful I first did:
for c in cat_vars:
train_df[c] = train_df[c].astype('category').cat.as_ordered()
test_df[c] = test_df[c].astype('category').cat.as_ordered()
and ended up with a bunch of CUDA errors in prediction stage, which took me awhile to trace down to this anomaly (and using apply_cats() fixed it), since I was having a class in the test ds, which wasn’t in train ds.
np.unique(train_df['Parch'])
np.unique(test_df['Parch'])
array([0, 1, 2, 3, 4, 5, 6])
array([0, 1, 2, 3, 4, 5, 6, 9])
so don’t skip on apply_cats
.