Correcting Class imbalance for NLP

EinAeffchen · September 10, 2018, 3:11pm

Hello everyone,

so I have a dataset with a decent class imbalance on 4 classes of (18%, 18%, 14%, 49%), where the first 3 are concrete emotions and the fourth one is a class for “other”. Usually I would go ahead and put the apprpriate weights in a loss function to fix the imbalance, but I don’t see any possibility to do so with fast.ai.
Am I overlooking something or does it even make sense in that case to use weights to fix the imbalance?

Even · September 10, 2018, 3:59pm

You can do this directly with a custom loss function that wraps cross entropy. Wouldn’t be a bad feature to add as it’s a common task, but it’s a little tricky as you generally want to train on this loss but evaluate your validation/test on the unbalanced set so you’d want to add CE as a metric. The other option is over or undersampling your data, which would mean a class aware dataloader.

knesgood · September 17, 2018, 3:01pm

@EinAeffchen - if you do this, would you mind posting your code on how you solve it? I’d be interested - my programming fu isn’t up to snuff yet and I’d love to see an example.

EinAeffchen · September 17, 2018, 6:54pm

@knesgood if I get it done, I’ll post the code here. But for now I have no idea how to do it either.

EinAeffchen · September 18, 2018, 11:31am

So it seems to be actually way easier than expected. With a tiny digging through the code I saw that the RNN_Learner overwrites it’s super classes “Learner”'s _get_crit function with a return of the Pytorch F.cross_entropy function. That already accepts weights, so you can just pass your calculated weights as

loss_weights = torch.FloatTensor(trn_weights).cuda()
learn.crit = partial(F.cross_entropy, weight=loss_weights)

I calculated my weights simply with that code:

trn_labelcounts = df_trn.groupby(["labels"]).size()
val_labelcounts = df_val.groupby(["labels"]).size()
trn_label_sum = len(df_trn["labels"])
val_label_sum = len(df_val["labels"])
trn_weights = [count/trn_label_sum for count in trn_labelcounts]
val_weights = [count/val_label_sum for count in val_labelcounts]
trn_weights, val_weights

To check the correct parsing of your weights you can simply print them:

print(learn.crit)

which should return something like:

functools.partial(<function cross_entropy at 0x00000282813B3268>, weight=tensor([0.1815, 0.1816, 0.1414, 0.4956], device='cuda:0'))

If you have any trouble don’t hesitate to write me @knesgood

piotr.czapla · September 26, 2018, 9:02pm

You want to penalize the frequency, not reward it.

Replace this:

trn_weights = [count/trn_label_sum for count in trn_labelcounts]
val_weights = [count/val_label_sum for count in val_labelcounts]

with this:

trn_weights = [1 - count/trn_label_sum for count in trn_labelcounts]
val_weights = [1 - count/val_label_sum for count in val_labelcounts]

Hugues1965 · October 24, 2018, 4:26pm

@EinAeffchen
I’ve re-used your code like this below in lesson1 notebook. Is this the way to do it ? Because I don’t really improve my test set accuracy doing so. I tried swapping my weights to [0.99,0.01] but it gets worse, so I think I got them in the right order. Not sure what is going on.

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
loss_weights = torch.FloatTensor([0.01,0.99]).cuda()
learn.crit = partial(F.cross_entropy, weight=loss_weights) 
learn.fit(0.01, 50)

EinAeffchen · October 27, 2018, 12:13pm

@Hugues1965 your weights need to correct the imbalance in your dataset. For example if you have 100 dog images and 50 cat images you want your weights to be like [1,2] or [0.5,1]. You want your classifier to fix the inequality of training data.

My current weight calculation looks like this:

trn_labelcounts = df_trn.groupby(["labels"]).size()
val_labelcounts = df_val.groupby(["labels"]).size()
trn_label_sum = len(df_trn["labels"])
val_label_sum = len(df_val["labels"])
trn_weights = [count/trn_label_sum for count in trn_labelcounts]
val_weights = [count/val_label_sum for count in val_labelcounts]
trn_weights, val_weights
rtd_val_weights = [max(val_weights)/value for value in val_weights]
rtd_val_weights

which returns a vector like this:
[2.5775862068965516, 2.794392523364486, 3.682266009852217, 1.0]

as described above my classes had a distribution of 18%, 18%, 14%, 49%. So I basically take out the maximum occurence and divide it by the occurence of each other value, to get the weight for each of the other classes.

Hugues1965 · October 27, 2018, 5:40pm

thanks @EinAeffchen
I’ve got 2 classes only and 99% / 1% split. So my weights are ok then.
Maybe there is just no good signal in my data I guess.