hey guys I have tried to come up with a stratified batch sampler which would maintain the ratio of distinct classes as 1:1:1 … by randomly selecting equal no of samples from each class to build a batch
here is the code . Pleae let me know if there are any error
@ilovescience I believe the difference here is instead of purely oversampling the minority classes, it balances both randomly. @champs.jaideep correct me if I’m wrong
Not necessarily. My previous incorrect version OverSamplingCallback did something similar before I corrected it. I had originally took the size of the most imbalanced class, and set that to the total size. This lead to some classes being oversampled and some being undersampled. But these results were poor, and when I fixed the error the results were much better.
I am a little surprised in your experiments oversampling did not improve your results. I think they key thing is the fact that there are no good augmentations for tabular data like there is for images, so with oversampling, in tabular data, it has a higher chance of simply memorizing the data points, whereas the augmentations with images offset that effect.
@ilovescience interesting! How did you go about fixing this error?
And I do notice improved results in a particular research project I am doing as there is some relatively heavy class imbalances going on (and with multiple classes too).
The weighted sampler takes in the total number of samples, which I incorrectly set to the length of the original dataset.
If you look at the callback code, I now have a variable self.total_len_oversample which is the correct number of samples to pass into the WeightedRandomSampler in order to do correct oversampling:
Interesting… I see now. Regardless we can come to the same conclusion that keeping out data (partial down sampling) did not seem to help at all. (If anyone’s seen it help chime in)
It’s pretty clear to me that you are trying to use this for this competition. I will say that I unfortunately did not have success with oversampling for this competition. But feel free to try and let me know if you see anything different.
It does. Though it’s not hard-coded for a 5 class problem. I use the same code for my research where I have 10-72 different classes.(Unless I looked at the wrong function!)
yes with oversampling there is not much help so i thought of trying out this way.
However m facing some issues in running it through,i see that the event is not getting called properly. I printed the statement in the iter but that is not happening at the start of training and hence a batch begin. Also it gets called at the end of first epoch not sure if i understand correctly the way event gets called during training.