What happens when train and test categorical classes are different?

Experimenting with the titanic kaggle competition I stumbled first on a bug and then a question of what happens when categorical sets in test and train data sets are different?

In Titanic dataset this is the case for column Parch. The test ds has 8 classes, and train ds only 7.

Consider:

for c in cat_vars: 
    train_df[c] = train_df[c].astype('category').cat.as_ordered()
apply_cats(test_df, train_df)

np.unique(train_df['Parch'])
np.unique(test_df['Parch'])

array([0, 1, 2, 3, 4, 5, 6])
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6., nan, nan])

so apply_cats just deleted the 8th class in the test set as if it has never existed. So the unknown/new class is equal to a non-existing class, yet, the train dataset didn’t have non-existing nan class (in this particular case). So what does it mean for the prediction - how does NN handle this case?

To clarify my question: What happens when during prediction on the test set there appears a data point that is NaN, whereas the train set never had a NaN for the same variable?

And I’m also puzzled at how apply_cats turned ints into floats, I don’t see any conversions here:

def apply_cats(df, trn):
    for n,c in df.items():
        if (n in trn.columns) and (trn[n].dtype.name=='category'):
            print(trn[n].cat.categories) # checked it to be integers
            df[n] = pd.Categorical(c, categories=trn[n].cat.categories, ordered=True)

edit: it appears that this happens because of NaN values which casts the whole array from ‘int’ to ‘float’. proc_df() later rectifies that. Supposedly this behavior (inability to have int NaNs will get fixed in pandas2 whenever that will happen, see here and here.


and while we are at it, just for others who might find it useful I first did:

for c in cat_vars: 
    train_df[c] = train_df[c].astype('category').cat.as_ordered()
    test_df[c] = test_df[c].astype('category').cat.as_ordered()

and ended up with a bunch of CUDA errors in prediction stage, which took me awhile to trace down to this anomaly (and using apply_cats() fixed it), since I was having a class in the test ds, which wasn’t in train ds.

np.unique(train_df['Parch'])
np.unique(test_df['Parch'])
​
array([0, 1, 2, 3, 4, 5, 6])
array([0, 1, 2, 3, 4, 5, 6, 9])

so don’t skip on apply_cats.

2 Likes

I had a similar question. Not sure if it was answered before, but I could not find an answer when I searched. If the codes for the categories in the training and test sets are different will it impact the prediction scores? For example, if ‘Category value 1’ has code 100 in the training set and ‘Category Value 1’ has code 101 in the test set, will it impact the results? This case can result if the test set is not derived by splitting the dataset into train and test, but the test set is a separate dataset given by the competition.

To answer the questions:

  1. You should definitely use the same mappings from training to test/validation. I believe the df can export the cat mapping. And then you can import it. I did this for another project but I can’t seem to find my notebook anymore so can’t copy the source right now.

  2. I believe in one of the part 1 lectures (part 4? - the one with the housing prices) it was suggested to solve this problem you create an extra category in training to handle unknown categories. You can then turn on training (the new category would basically be associated with an random weight since no training data would have it) and then use that category if you do not have a correct mapping. Alternatively if you know anything about the new category (i.e. it is basically the same as an existing category) you could potentially map it to an existing category.

It’s probably better to tweak proc_df a bit and add LabelEncoder to categorical columns.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

LabelEncoder is basically a dictionary. You can extract it and use it for future encoding:

from sklearn.preprocessing import LabelEncoder

le = preprocessing.LabelEncoder()
le.fit(X)

le_dict = dict(zip(le.classes_, le.transform(le.classes_)))

Retrieve label for a single new item, if item is missing then set value as unknown

le_dict.get(new_item, '<Unknown>')

Retrieve labels for a Dataframe column:

df[your_col].apply(lambda x: le_dict.get(x, '<Unknown>'))

A solution suggested by chatgpt:

from fastai.tabular.all import *
import pandas as pd
from pandas.api.types import CategoricalDtype

# List of animals including 'other'
animals = ['dog', 'cat', 'hamster', 'parrot', 'mouse', 'other']

# Create a categorical dtype for animals
animal_dtype = CategoricalDtype(categories=animals, ordered=False)

# Example data
train_data = pd.DataFrame({'animal': ['cat', 'dog', 'hamster', 'parrot', 'elephant']})
test_data = pd.DataFrame({'animal': ['cat', 'dog', 'hamster', 'parrot', 'mouse', 'lion']})

# Replace non-listed animals with 'other'
train_data['animal'] = train_data['animal'].where(train_data['animal'].isin(animals), 'other')
test_data['animal'] = test_data['animal'].where(test_data['animal'].isin(animals), 'other')

# Convert the 'animal' column to the new categorical dtype
train_data['animal'] = train_data['animal'].astype(animal_dtype)
test_data['animal'] = test_data['animal'].astype(animal_dtype)

# Check the transformed data
print(train_data)
print(test_data)