I am trying to figure out how to use the Data Block API for cases where the output is a distribution of ratings predicted as a histogram. I am trying to implement the following paper - https://arxiv.org/pdf/1709.05424.pdf
Each image has 10 possible scores and each score gets a certain number of votes. Here is an example of the data -
image_name, votes
57284, 4 6 27 54 70 49 23 10 3 3
57225, 3 10 40 62 67 38 19 10 6 3
57217, 1 2 13 29 78 68 36 18 4 4
57252, 5 22 34 58 63 25 9 6 2 1
57211, 3 12 35 57 81 41 16 6 5 1
57222, 1 11 35 61 61 47 21 8 0 2
56815, 13 20 52 83 47 19 7 5 1 2
56828, 0 6 17 21 56 67 45 29 5 6
57122, 0 7 11 23 54 70 41 32 10 7
The total number of labels are 10 since there are 10 possible scores.
My initial approach was to use it using label_from_df
like so -
data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
.split_by_rand_pct()
.label_from_df(label_delim=' ')
.transform(tfms, size=224)
.databunch(bs=8))
This doesn’t work since the labels returned in each case are the distribution scores.
The other approach I use is label_from_func
like so -
data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
.split_by_rand_pct()
.label_from_func(func)
.transform(tfms, size=224)
.databunch(bs=8))
Where func
returns the image_name. This doesn’t work since the labels in the validation set do not show up in the train set and the validation data is empty.
I overrode the filter_missing_y
like so -
class NimaLabelList(CategoryList):
def __init__(self, items:Iterator, classes:Collection=None, label_delim:str=None, **kwargs):
super().__init__(items, classes=classes, **kwargs)
self.filter_missing_y = False
data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
.split_by_rand_pct()
.label_from_func(func, label_cls=NimaLabelList)
.transform(tfms, size=224)
.databunch(bs=8))
In this case, I don’t get empty validation data but the validations labels are all 0. Which is consistent since these labels are not part of the training data.
Finally, I tried this -
data = (ImageList.from_csv('./', 'labels_sample.csv', folder='data', suffix='.jpg')
.split_by_rand_pct()
.label_const(const=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
.transform(tfms, size=224)
.databunch(bs=8))
This way I have all the training and validation data.
With the last approach, I need to figure out a way to be able to look up the index of the item to get it’s score distribution which is required to compute the loss function when I use the dataloader.
x,y = next(iter(data.train_dl))
Another approach is to use ItemList instead of ImageList and see if I can work around the problem. But before going down that route, I would like to understand if there are other approaches or if I am missing something obvious.
Thanks in advance.