How to train Multilabel classification with negative observations

I’m working with an academic dataset which has negative labeled observations in the training, validation and test set. Negative label meaning the absence of any class, i.e a label of [0,0,0] for a problem with three classes. I’m not sure how to do add these observations to the training and validation set using the data_block api.

I’ve tried setting the negative observations to an empty string but that results in the empty string label added to the classes, which is not what i’m looking for as I don’t want to predict a separate class for the absence of any class.

For example in the lesson3-planets.ipynb if I set tags of cloudy to an empty string:

df.loc[df.tags == 'cloudy', 'tags'] = ''
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)
np.random.seed(42)

src = (ImageList.from_df(df, path, folder='train-jpg', suffix='.jpg')
       .split_by_rand_pct(0.2)
       .label_from_df(label_delim=' '))
data = (src.transform(tfms, size=128)
        .databunch().normalize(imagenet_stats))
arch = models.resnet18

acc_02 = partial(accuracy_thresh, thresh=0.2)
f_score = partial(fbeta, thresh=0.2)
learn = cnn_learner(data, arch, metrics=[acc_02, f_score])

print(learn.data.classes)

['',
 'agriculture',
 'artisinal_mine',
 'bare_ground',
 'blooming',
 'blow_down',
 'clear',
 'conventional_mine',
 'cultivation',
 'habitation',
 'haze',
 'partly_cloudy',
 'primary',
 'road',
 'selective_logging',
 'slash_burn',
 'water']

I’ve also tried setting the label to None which results in this error:

~/code/plaquebox-classifier/venv/lib/python3.6/site-packages/fastai/data_block.py in label_from_df(self, cols, label_cls, **kwargs)
    281         labels = self.inner_df.iloc[:,df_names_to_idx(cols, self.inner_df)]
    282         # import pdb; pdb.set_trace();
--> 283         assert labels.isna().sum().sum() == 0, f"You have NaN values in column(s) {cols} of your dataframe, please fix it."
    284         if is_listy(cols) and len(cols) > 1 and (label_cls is None or label_cls == MultiCategoryList):
    285             new_kwargs,label_cls = dict(one_hot=True, classes= cols),MultiCategoryList

AssertionError: You have NaN values in column(s) 1 of your dataframe, please fix it.

How about setting the empty ones with some value as 'no_label'.

That would effectively create another class like [0, 0, 0, 1] whereas i’m trying to create [0, 0, 0].

I think I figured it out. Continuing with the lesson3-planets.ipynb example:

First one hot encode the tags, excluding the negative class. In this example I’m calling cloudy as the negative label so I’ve left it out of the class list.

classes = [  
             'agriculture',
             'artisinal_mine',
             'bare_ground',
             'blooming',
             'blow_down',
             'clear',
             'conventional_mine',
             'cultivation',
             'habitation',
             'haze',
             'partly_cloudy',
             'primary',
             'road',
             'selective_logging',
             'slash_burn',
             'water']


def one_hot_encode(s, classes=[]):
    lst = s.split(' ')
    res = []
    for c in classes:
        if c in lst:
            res.append(1)
        else:
            res.append(0)
    return(res)


labels = df['tags'].map(lambda s: one_hot_encode(s, classes=classes))
labels = torch.FloatTensor(list(labels.values))

# check one cloudy observation is all zeros
assert torch.equal(labels[df.loc[df.tags == 'cloudy'].index[0]], torch.zeros(len(classes)))

Then I pass the labels tensor to the _label_from_list method with one_hot=True.


tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)
src = (ImageList.from_df(df, path, folder='train-jpg', suffix='.jpg')
       .split_by_rand_pct(0.2)._label_from_list(labels, one_hot=True, classes=classes))

assert src.y.classes == classes

arch = models.resnet18
acc_02 = partial(accuracy_thresh, thresh=0.2)
f_score = partial(fbeta, thresh=0.2)
learn = cnn_learner(data2, arch, metrics=[acc_02, f_score])
assert learn.data.classes == classes