How should `ImageMultiDataset ` handle labels in `valid_ds` unseen by `train_ds`

yang-zhang · October 24, 2018, 3:32pm

When ImageMultiDataset.__init__ is passed a labels and a non-null classes , there is no guarantee that each item in labels is already in classes.
A typical example is train_ds.classes being passed as this non-null classes. (link to code)

When an item in labels is not already in classes, a KeyError will be raised by this line of code self.y = [np.array([self.class2idx[o] for o in l], dtype=np.int64) for l in labels] (link to code).

This situation can occur for multilabel datasets with a large number of unbalanced labels and/or with a heavy tail. A real example is deepfashion dataset’s attributes data.

When passing a multilabel df to ImageDataBunch.from_df, ImageMultiDataset.from_folder will random_split the examples, and some infrequent labels may end up only in the validation set. (of course, some may be only in the training set.)

Typically we use a smaller sample of the data to develop the model before running it on all the data. The sampling can make this situation more likely to occur.

This thread is to discuss the handling of this situation. Some previous discussions were in this closed PR.

Notebook showing the situation in this gist.

jeremy · October 24, 2018, 6:35pm

Many thanks for this useful and important discussion - and the notebook!

I don’t think we should remove labels from the val set, since then we’re reporting an incorrect error/loss. i.e. imagine if 80% of the labels in the val set are new - then we’re ignoring them all in our calculation!

We could add a parameter that maps them to some existing level, and raises a warning if that happens. How does that sound? You could even add an “unknown” class when training.

yang-zhang · October 24, 2018, 7:44pm

Jeremy, thanks for the suggestions.

We could add a parameter that maps them to some existing level, and raises a warning if that happens.

This should be doable by adding an optional parameter. I don’t think we can map an unknown label to an arbitrary existing level in train set, and should instead require the user to specify what level(s) to map to. I myself feel difficult to imagine a scenario where I would know how to specify this mapping. (unless one preemptively adds an “unknown” level to the training set, which feels clunky).

You could even add an “unknown” class when training.

This option seems more natural to me. But I have a hard time imaging doing this without updating train_ds.classes, train_ds.classes2idx, and train_ds.y.

I think it is best to make no changes to the code. And when creating train_ds, if you think this “val-labels-not-in-train” situation might happen, pass the union of the labels of training and validation labels. I’ve updated the gist to show this usage pattern.

jeremy · October 25, 2018, 3:18am

Yes that seems like a good suggestion.