Is there is a ‘beginner/intermediate’ friendly way to do oversampling in fastai2?
I’ve seen this post on this forum: [Oversampling Callback]. But it’s only for fastai1 I think. Getting this to work on fastai2 would be very useful, since i’m interested in medical imaging and most medical image datasets are highly unbalanced.
The last time I tried to use the weighted dataloader I could not get it working properly, so I used a different methodology: exporting pandas DataFrames, performing the oversampling there then converting it back. It’s easier than it sounds, and the full example is here
@muellerzr I don’t think i’m at that level yet, but might be a nice project for the future.
Is there an advantage of a weighted DataLoader above oversampling?
@ilovescience Thanks, this seems like a nice solution. I’m going to try it.
I am not sure about it, but I think using the weighted dataloaders is more memory efficient than a classic oversampler that replicates data, just because the dataloder passes the items from the minority class more frequently to the model, but without an explicit replication…(I should double check this though)
Anyway, what I found to work better in my problems with imbalanced data is just to assign weights in the loss function, so that you penalize more the errors in the minority class.
wanted to double check, the way way you calculated the weights?
Is it the same as how you would calculate the weights for a weighted dataloader? ie 1/number_of_samples_per_class or were there any tweaks ?
I have very samples of certain classes so my ratio is 2000 to 1 and i was not getting great result with weighted loss functions.
I usually apply the rule 1/n_samples_in_class in the weighted loss function and it works well in terms of Recall and F1, which are the metrics I am interested in in this type of problems.
But providing an example for folks who want to apply Weighted Cross-Entropy as the loss function for the imbalanced dataset.
Weighted Cross-Entropy
#Get weights based on the class distribution in the training data
def get_weights(dls):
# 0th index would provide the vocab from text
# 1st index would provide the vocab from classes
classes = dls.vocab[1]
#Get label ids from the dataset using map
#train_lb_ids = L(map(lambda x: x[1], dls.train_ds))
# Get the actual labels from the label_ids & the vocab
#train_lbls = L(map(lambda x: classes[x], train_lb_ids))
#Combine the above into a single
train_lbls = L(map(lambda x: classes[x[1]], dls.train_ds))
label_counter = Counter(train_lbls)
n_most_common_class = max(label_counter.values());
print(f'Occurrences of the most common class {n_most_common_class}')
# Source: https://discuss.pytorch.org/t/what-is-the-weight-values-mean-in-torch-nn-crossentropyloss/11455/9
weights = [n_most_common_class/v for k, v in label_counter.items() if v > 0]; return weights
Just to help people which is trying to get somekind of OverSampler, here I post the “hack” I had to do to implement it This is a first iteration, so this could have some improvements to be more robust, so feel free to suggest improvements
In order to set you in context, here the splitter is somehow splitting by index depending on the column Dataset. Besides, mention that the label_df DataFrame is sorted to have first the rows regarding training, this is why I use [:len(train_df)] on the weights.
label_db = DataBlock(
blocks=(ImageBlock(cls=PILImageBW), MultiCategoryBlock),
get_x=ColReader('Original_Filename', pref=raw_preprocess_folder+'/', suff='.png'),
get_y=ColReader('Target'),
splitter=TestColSplitter(col='Dataset'),
item_tfms=item_tfms,
batch_tfms=label_transform,
)
label_dl = label_db.dataloaders(label_df, bs=BATCH_SIZE, num_workers=0, shuffle_train=True, drop_last=True)
# Calculate sample weights to balance the DataLoader
from collections import Counter
count = Counter(label_dl.items['Target'])
class_weights = {}
for c in count:
class_weights[c] = 1/count[c]
wgts = label_dl.items['Target'].map(class_weights).values[:len(train_df)]
weighted_dl = label_db.dataloaders(label_df, bs=BATCH_SIZE, num_workers=0, shuffle_train=True, drop_last=True, dl_type=WeightedDL, wgts=wgts)
label_dl.train = weighted_dl.train
How i can make use of this to perform balanced sampling like
say C1-5k
C2-2k
C3-1.5k
I wanted the batch to have class ratio more or less 1:1:1
so out of C1 i may want to select not all but just ones that match in number to other less dominant classes