Using Data block API when output is a distribution of n classes

I am trying to figure out how to use the Data Block API for cases where the output is a distribution of ratings predicted as a histogram. I am trying to implement the following paper - https://arxiv.org/pdf/1709.05424.pdf

Each image has 10 possible scores and each score gets a certain number of votes. Here is an example of the data -

image_name, votes
57284, 4 6 27 54 70 49 23 10 3 3 
57225, 3 10 40 62 67 38 19 10 6 3 
57217, 1 2 13 29 78 68 36 18 4 4 
57252, 5 22 34 58 63 25 9 6 2 1 
57211, 3 12 35 57 81 41 16 6 5 1 
57222, 1 11 35 61 61 47 21 8 0 2 
56815, 13 20 52 83 47 19 7 5 1 2 
56828, 0 6 17 21 56 67 45 29 5 6 
57122, 0 7 11 23 54 70 41 32 10 7

The total number of labels are 10 since there are 10 possible scores.
My initial approach was to use it using label_from_df like so -

data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
        .split_by_rand_pct()
        .label_from_df(label_delim=' ')
        .transform(tfms, size=224)
        .databunch(bs=8))

This doesn’t work since the labels returned in each case are the distribution scores.

The other approach I use is label_from_func like so -

data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
        .split_by_rand_pct()
        .label_from_func(func)
        .transform(tfms, size=224)
        .databunch(bs=8))

Where func returns the image_name. This doesn’t work since the labels in the validation set do not show up in the train set and the validation data is empty.

I overrode the filter_missing_y like so -

class NimaLabelList(CategoryList):
    def __init__(self, items:Iterator, classes:Collection=None, label_delim:str=None, **kwargs):
        super().__init__(items, classes=classes, **kwargs)
        self.filter_missing_y = False
data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
        .split_by_rand_pct()
        .label_from_func(func, label_cls=NimaLabelList)
        .transform(tfms, size=224)
        .databunch(bs=8))

In this case, I don’t get empty validation data but the validations labels are all 0. Which is consistent since these labels are not part of the training data.

Finally, I tried this -

data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
        .split_by_rand_pct()
        .label_const(const=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
        .transform(tfms, size=224)
        .databunch(bs=8))

This way I have all the training and validation data.

With the last approach, I need to figure out a way to be able to look up the index of the item to get it’s score distribution which is required to compute the loss function when I use the dataloader.

x,y = next(iter(data.train_dl))

Another approach is to use ItemList instead of ImageList and see if I can work around the problem. But before going down that route, I would like to understand if there are other approaches or if I am missing something obvious.

Thanks in advance.

Here is how I ended up using the Data block API, in case someone stumbles across a similar use case.

class NimaLabelList(CategoryList):
    _processor=None
    def __init__(self, items:Iterator, classes=labels, label_delim:str=None, **kwargs):
        super().__init__(items, classes=classes, **kwargs)

    def get(self, i):
        dist = scores_map[self.items[i]]
        dist = np.array(dist.split(' '), dtype=float)
        return dist
data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
        .split_by_rand_pct()
        .label_from_func(func, label_cls=NimaLabelList)
        .transform(tfms, size=224)
        .databunch(bs=8))
data.c = 10
arch  = models.mobilenet_v2
learner = cnn_learner(data, arch, pretrained=True)

I created a custom label class called NimaLabelList that inherits from CategoryList. A few things to note here -

  • I set _processor=None
  • classes = labels # labels is all possible image ids
  • The get function which returns the distribution for each image

Lastly, I set the data.c = 10 so that the model returns 10 values which corresponds to the 10 classes.