Balancing multi-label data in fastai

austinmw · July 18, 2019, 3:18pm

Hey, fastai users, what techniques are you currently using to balance multi-label data? I’ve currently been doing manual majority under-sampling before instantiating my ImageDataBunch, but I’m curious if there is a more automated or preferable way.

Does there happen to exist fastai utilities to do any of the following?

minority over-sampling
majority under-sampling
multilabel class weighting in loss function

Any suggestions are greatly appreciated!

muellerzr · July 18, 2019, 3:23pm

Recently an oversampling callback was done by @ilovescience, Oversampling Callback

austinmw · July 18, 2019, 3:47pm

@muellerzr Thanks. I just tried this and it looks like it doesn’t currently support multi-label data though:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/callbacks/oversampling.py in __init__(self, learn, weights)
     15         _, counts = np.unique(self.labels,return_counts=True)
     16         self.weights = (weights if weights is not None else
---> 17                         torch.DoubleTensor((1/counts)[self.labels]))
     18         self.label_counts = np.bincount([self.learn.data.train_dl.dataset.y[i].data for i in range(len(self.learn.data.train_dl.dataset))])
     19         self.total_len_oversample = int(self.learn.data.c*np.max(self.label_counts))

IndexError: arrays used as indices must be of integer (or boolean) type

But good to know about this callback for non-multi-label problems!