so I have a dataset with a decent class imbalance on 4 classes of (18%, 18%, 14%, 49%), where the first 3 are concrete emotions and the fourth one is a class for “other”. Usually I would go ahead and put the apprpriate weights in a loss function to fix the imbalance, but I don’t see any possibility to do so with fast.ai.
Am I overlooking something or does it even make sense in that case to use weights to fix the imbalance?
You can do this directly with a custom loss function that wraps cross entropy. Wouldn’t be a bad feature to add as it’s a common task, but it’s a little tricky as you generally want to train on this loss but evaluate your validation/test on the unbalanced set so you’d want to add CE as a metric. The other option is over or undersampling your data, which would mean a class aware dataloader.
@EinAeffchen - if you do this, would you mind posting your code on how you solve it? I’d be interested - my programming fu isn’t up to snuff yet and I’d love to see an example.
So it seems to be actually way easier than expected. With a tiny digging through the code I saw that the RNN_Learner overwrites it’s super classes “Learner”'s _get_crit function with a return of the Pytorch F.cross_entropy function. That already accepts weights, so you can just pass your calculated weights as
@EinAeffchen
I’ve re-used your code like this below in lesson1 notebook. Is this the way to do it ? Because I don’t really improve my test set accuracy doing so. I tried swapping my weights to [0.99,0.01] but it gets worse, so I think I got them in the right order. Not sure what is going on.
@Hugues1965 your weights need to correct the imbalance in your dataset. For example if you have 100 dog images and 50 cat images you want your weights to be like [1,2] or [0.5,1]. You want your classifier to fix the inequality of training data.
My current weight calculation looks like this:
trn_labelcounts = df_trn.groupby(["labels"]).size()
val_labelcounts = df_val.groupby(["labels"]).size()
trn_label_sum = len(df_trn["labels"])
val_label_sum = len(df_val["labels"])
trn_weights = [count/trn_label_sum for count in trn_labelcounts]
val_weights = [count/val_label_sum for count in val_labelcounts]
trn_weights, val_weights
rtd_val_weights = [max(val_weights)/value for value in val_weights]
rtd_val_weights
which returns a vector like this: [2.5775862068965516, 2.794392523364486, 3.682266009852217, 1.0]
as described above my classes had a distribution of 18%, 18%, 14%, 49%. So I basically take out the maximum occurence and divide it by the occurence of each other value, to get the weight for each of the other classes.