CategoryBlock with vocab confusions

Hi all, apologies in advance if I’m holding it wrong, but I think I’ve found some really odd behavior in the TabularLearner.

I’m using it for single-label categorization. My xs contain categorical and continuous variables, and I’m using procs=[Categorify] to handle the category fields. When I train and run inference without a y_block, I get good results. However, for reasons, I want to specify my own vocab for the y_block which includes labels not currently included in the training data. When I define a CategoryBlock with vocab that differs in any way from the inferred vocab that happens automatically without passing vocab in a y_block argument, inference gets wacky. Training still appears to work fine, but at inference time all the category ids are mixed up. Based on my troubleshooting, this only happens when the categories that are unseen in the training dataset appear alphabetically before or intermixed with the inferred categories.

I’ve further isolated this to be independent of new labels, but specific to a mismatch (or the library ignoring) the vocab passed in via the CategoryBlock. Below is a minimal repro:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, size=(1000,1)), columns=['number'])
df['label'] = 'num-' + df['number'].astype(str)
df.loc[:700, 'is_valid'] = False
df.loc[700:, 'is_valid'] = True

df_test = pd.DataFrame(np.random.randint(1, 10, size=(100,1)), columns=['number'])
df_test['label'] = 'num-' + df_test['number'].astype(str)

at this point, df looks like:

| **

number** label is_valid
0 8 num-8
1 6 num-6
2 3 num-3
3 4 num-4
4 9 num-9
995 9 num-9
996 8 num-8
997 6 num-6
998 1 num-1
999 9 num-9

1000 rows × 3 columns

Now a working, unconfigured training run looks like:

from fastai.tabular.all import *

dls = TabularDataLoaders.from_df(
    df.drop(columns=['is_valid']),
    cat_names=['number'],
    y_names=["label"],
    procs=[Categorify],
    bs=8,
    device=torch.device('cuda'),
    num_workers=8,
)
learn = tabular_learner(
    dls,
    metrics=[
        accuracy,
        F1Score(average="weighted"),
        Precision(average="weighted"),
        Recall(average="weighted"),
    ],
)
learn.fit_one_cycle(5)
dl = learn.dls.test_dl(df_test)
learn.show_results(dl=dl)

and the output I’d expect:

epoch train_loss valid_loss accuracy f1_score precision_score recall_score time
0 0.518991 0.303508 1.000000 1.000000 1.000000 1.000000 00:01
1 0.111306 0.012029 1.000000 1.000000 1.000000 1.000000 00:01
2 0.041993 0.004647 1.000000 1.000000 1.000000 1.000000 00:01
3 0.019425 0.002686 1.000000 1.000000 1.000000 1.000000 00:01
4 0.015001 0.002006 1.000000 1.000000 1.000000 1.000000 00:01
number label label_pred
0 8 7 7
1 5 4 4
2 8 7 7
3 8 7 7
4 6 5 5
5 5 4 4
6 3 2 2
7 3 2 2

You can see the model perfectly predicts the label. dls.show_batch() returns:

number label
0 9
1 3
2 1
3 2
4 6
5 9
6 5
7 8

Now if I do everything the same, but pass in a category block with an extra category first thing in the vocab like so:

dls = TabularDataLoaders.from_df(
    df.drop(columns=['is_valid']),
    cat_names=['number'],
    y_names=["label"],
    y_block=CategoryBlock(vocab=['num-0', 'num-1', 'num-2', 'num-3', 'num-4', 'num-5', 'num-6', 'num-7', 'num-8', 'num-9']),
    procs=[Categorify],
    bs=8,
    device=torch.device('cuda'),
    num_workers=8,
)

I get good training metrics, but the labels are all off by one:

epoch train_loss valid_loss accuracy f1_score precision_score recall_score time
0 0.770442 0.339107 1.000000 1.000000 1.000000 1.000000 00:01
1 0.154223 0.013723 1.000000 1.000000 1.000000 1.000000 00:01
2 0.047066 0.005719 1.000000 1.000000 1.000000 1.000000 00:01
3 0.022238 0.003779 1.000000 1.000000 1.000000 1.000000 00:01
4 0.020751 0.003900 1.000000 1.000000 1.000000 1.000000 00:01
number label label_pred
0 1 1 0
1 3 3 2
2 8 8 7
3 2 2 1
4 3 3 2
5 6 6 5
6 5 5 4
7 4 4 3

and dls.show_batch() confirms the labeling skew:

| **

number** label
0 8
1 8
2 7
3 7
4 9
5 9
6 2
7 7

My expectation would be that the vocab I pass into the y_block would be used for all bidirectional mapping, but it appears that internally a new mapping is being created based on labels discovered in the data and those indexes are then being used against the y_block vocab, but the values that sit at those indexes are different.

Is this a bad expectation? Have I left off some argument somewhere? Is this a bug? I’d really like to be able to specify label mappings and utilize the intrinsic category handling provided by the library.

I started to trace this through the github repo but the way the code is set up I think it’ll take me a while, so I’m asking here first.