Data Block labeling looks only at training set

I noticed recently that if I split my data differently in the splitting step, the classes my databunch objects showed would be different. For example, if I used split_by_rand_pct with different percentages and seeds, there would be random leakage of a couple of classes. I realized eventually that this was because the classes come from the training split, and the sparse nature of my data meant that some classes would slip through and never end up in the training set.

Obviously you wouldn’t want your validation set containing classes that the training set doesn’t have. But would there be a way to alert the user to an issue like this? I personally believe that these details are hard for the user to think of or predict, and it’s easy to end up training models on data split differently (and consequently with a different number of classes), and then find that these models are incompatible upon trying to load one (and finding that the architecture is technically different because the final layer in the custom head has a different number of outputs).

1 Like