Multilabel classification with missing labels


I’m trying to implement ULmFit for a multilabel classification but where most of the data has missing labels. Each sample can be assigned to 1 or more classes (92 in total). In principle it seems straightforward, but the problem is that most samples weren’t labelled on all classes. This means that most of my data has missing labels (NaN).


   Sample      Class1    Class2    Class3     Class4
    A                                           1
    B            0
    C            1                   0

Where a 1 indicates a sample is positive for a class, and 0 indicates it’s negative. Missing values (NaN) gives no information about the relationship between a sample and the classes. It’s just missing for N reasons and unfortunatelly it’s not possible to fill these empty spaces.

My initial training strategy consisted of filling the missing values with -1 and modify the BCE loss function to ignore these entries (missing information wouldn’t contribute to solve the problem). That seemed to work, but when I looked into one batch (data.one_batch()), it showed targets where all labels were missing (-1 value). Surprisingly, when I looked into all training labels using data.train_ds.y.items, the labels were OK!

Anybody had similar problems?