ImageMultiDataset.__init__ is passed a
labels and a non-null
classes , there is no guarantee that each item in
labels is already in
A typical example is
train_ds.classes being passed as this non-null
classes. (link to code)
When an item in
labels is not already in
KeyError will be raised by this line of code
self.y = [np.array([self.class2idx[o] for o in l], dtype=np.int64) for l in labels] (link to code).
This situation can occur for multilabel datasets with a large number of unbalanced labels and/or with a heavy tail. A real example is deepfashion dataset’s attributes data.
When passing a multilabel df to
random_split the examples, and some infrequent labels may end up only in the validation set. (of course, some may be only in the training set.)
Typically we use a smaller sample of the data to develop the model before running it on all the data. The sampling can make this situation more likely to occur.
This thread is to discuss the handling of this situation. Some previous discussions were in this closed PR.
Notebook showing the situation in this gist.
Many thanks for this useful and important discussion - and the notebook!
I don’t think we should remove labels from the val set, since then we’re reporting an incorrect error/loss. i.e. imagine if 80% of the labels in the val set are new - then we’re ignoring them all in our calculation!
We could add a parameter that maps them to some existing level, and raises a warning if that happens. How does that sound? You could even add an “unknown” class when training.
Jeremy, thanks for the suggestions.
We could add a parameter that maps them to some existing level, and raises a warning if that happens.
This should be doable by adding an optional parameter. I don’t think we can map an unknown label to an arbitrary existing level in train set, and should instead require the user to specify what level(s) to map to. I myself feel difficult to imagine a scenario where I would know how to specify this mapping. (unless one preemptively adds an “unknown” level to the training set, which feels clunky).
You could even add an “unknown” class when training.
This option seems more natural to me. But I have a hard time imaging doing this without updating
I think it is best to make no changes to the code. And when creating
train_ds, if you think this “val-labels-not-in-train” situation might happen, pass the union of the labels of training and validation labels. I’ve updated the gist to show this usage pattern.
Yes that seems like a good suggestion.