Stratified batch sampler

hey guys I have tried to come up with a stratified batch sampler which would maintain the ratio of distinct classes as 1:1:1 … by randomly selecting equal no of samples from each class to build a batch
here is the code . Pleae let me know if there are any error

class class_balancer(Sampler):
    def __init__(self, df, bs ,trn_idx,ratio):
        
        self.ratio=np.array(ratio)
        self.counts=[int(ratio[0]*(bs//self.ratio.sum())),int(ratio[1]*(bs//self.ratio.sum())),
                     int(ratio[2]*(bs//self.ratio.sum())),
               int(ratio[3]*(bs//self.ratio.sum())),int(ratio[4]*(bs//self.ratio.sum()))]
        
        self.df=df.copy()
        self.bs=bs
        self.trn_idx=trn_idx
       
        
    def __iter__(self):
        sample=[random.sample(self.df.loc[(sel.df.diagnosis==i)&(self.df.index.isin(self.trn_idx))]
                              .index.tolist(),c) for i,c in enumerate(self.counts) ]
        sample=np.hstack(sample).tolist()
        if len(sample)<self.bs:
            sample=sample+random.sample(self.trn_idx,self.bs-len(sample))
        return iter(sample)

    def __len__(self):
        return len(self.trn_idx)
class OverSamplingCallback1(LearnerCallback):
    def __init__(self,learn:Learner, df,val_id,weights=None):
        super().__init__(learn)
        labels = self.learn.data.train_dl.dataset.y.items.astype(int)
        _,self.counts = np.unique(labels, return_counts=True)
       
        self.df=df.copy()
        self.train_idx=np.setdiff1d(arange_of(df), val_id)
        self.ratio=weights
        
        self.bs=learn.data.train_dl.batch_size
       
        
    def on_batch_begin(self, **kwargs):
                   sample= class_balancer(df=self.df,bs=self.bs,trn_idx=self.train_idx,ratio=self.ratio)
        len(self.learn.data.train_dl.dataset))
        self.learn.data.train_dl.dl.batch_sampler = BatchSampler(sample,self.learn.data.train_dl.batch_size, False)
1 Like

What is wrong with using the oversampling callback which already does this?

@ilovescience I believe the difference here is instead of purely oversampling the minority classes, it balances both randomly. @champs.jaideep correct me if I’m wrong

But isn’t the outcome the same?

The outcome is almost the same, yes. I saw minimal difference between “balance” sampling (stratified) vs oversampling. You can see my notebook here:

Is balance/stratified sampling your mix of undersampling and oversampling?

Correct. Where the peak is the mean. Was that incorrect to do?

Not necessarily. My previous incorrect version OverSamplingCallback did something similar before I corrected it. I had originally took the size of the most imbalanced class, and set that to the total size. This lead to some classes being oversampled and some being undersampled. But these results were poor, and when I fixed the error the results were much better.

I am a little surprised in your experiments oversampling did not improve your results. I think they key thing is the fact that there are no good augmentations for tabular data like there is for images, so with oversampling, in tabular data, it has a higher chance of simply memorizing the data points, whereas the augmentations with images offset that effect.

@ilovescience interesting! How did you go about fixing this error?

And I do notice improved results in a particular research project I am doing as there is some relatively heavy class imbalances going on (and with multiple classes too).

The weighted sampler takes in the total number of samples, which I incorrectly set to the length of the original dataset.

If you look at the callback code, I now have a variable self.total_len_oversample which is the correct number of samples to pass into the WeightedRandomSampler in order to do correct oversampling:

self.total_len_oversample = int(self.learn.data.c*np.max(self.label_counts))

Interesting… I see now. Regardless we can come to the same conclusion that keeping out data (partial down sampling) did not seem to help at all. (If anyone’s seen it help chime in)

1 Like

Yes I agree.

From what I understand, though, this sampler does that. Correct me if I am wrong.

Also, it is hardcoded for a 5-class problem.

@champs.jaideep

It’s pretty clear to me that you are trying to use this for this competition. I will say that I unfortunately did not have success with oversampling for this competition. But feel free to try and let me know if you see anything different.

It does. Though it’s not hard-coded for a 5 class problem. I use the same code for my research where I have 10-72 different classes.(Unless I looked at the wrong function!)

This line in the code for the sampler seems to suggest hard-coding for 5-class problem:

self.counts=[int(ratio[0]*(bs//self.ratio.sum())),int(ratio[1]*(bs//self.ratio.sum())),
                 int(ratio[2]*(bs//self.ratio.sum())),
           int(ratio[3]*(bs//self.ratio.sum())),int(ratio[4]*(bs//self.ratio.sum()))]

Again, it’s probably as the author is using it for a Kaggle competition, and can easily be modified for generality.

1 Like

yes
so if say i have classes as

  1. 1000
  1. 100
    3 . 200
    4 . 50

this classifier intents to maintain by random selection eq no of samples from each class in a batch
say if 80 ,20,20,20,20

yes its hardcoded…
for now. I will try to generalize it for any classes

yes with oversampling there is not much help so i thought of trying out this way.
However m facing some issues in running it through,i see that the event is not getting called properly. I printed the statement in the iter but that is not happening at the start of training and hence a batch begin. Also it gets called at the end of first epoch not sure if i understand correctly the way event gets called during training.

Here is the working version. First one had some bugs that i rectified…

class class_balancer(Sampler):
    def __init__(self, arr, bs ,ratio,bn):
        
        self.ratio=np.array(ratio)
        self.counts=[int(ratio[0]*(bs//self.ratio.sum())),int(ratio[1]*(bs//self.ratio.sum())),
                     int(ratio[2]*(bs//self.ratio.sum())),
               int(ratio[3]*(bs//self.ratio.sum())),int(ratio[4]*(bs//self.ratio.sum()))]
        
     
        self.bs=bs
        self.arr=arr
        self.batch_num=bn
      
    def __iter__(self):
        print('y1')
        flat_batch=[]
       
        for i in range(self.batch_num):
            
            sample=[random.sample(np.where(self.arr==i)[0].tolist(),c) for i,c in enumerate(self.counts)]
            if len(sample)<self.bs:
                sample=sample+random.sample(self.arr.tolist(),self.bs-len(sample))
            random.shuffle(sample)
            flat_batch.append( (np.hstack(sample)))
        
        flat_batch=np.array(flat_batch).flatten().tolist()
        flat_batch=flat_batch[0:self.arr.shape[0]]
       
        return iter(flat_batch)

    def __len__(self):
     
        return len(self.arr.tolist())

class OverSamplingCallback1(LearnerCallback):
    def __init__(self,learn:Learner,weights=None,bn=None):
        super().__init__(learn)
        labels = self.learn.data.train_dl.dataset.y.items.astype(int)
        self.labels_array=np.array(list(labels))
        _,self.counts = np.unique(labels, return_counts=True)
        self.bn=bn
       
      
        self.ratio=weights
        self.bs=self.learn.data.train_dl.batch_size
        self.learn=learn
  
    def on_epoch_begin(self,**kwargs):
       
        self.sample= class_balancer(arr=self.labels_array,bs=self.bs,ratio=self.ratio,bn=self.bn)
        self.learn.data.train_dl.dl.batch_sampler = BatchSampler(self.sample,self.learn.data.train_dl.batch_size, False)
2 Likes