Easy 10+% accuracy boost on unbalanced datasets - a "sane default" proposal

hushitz · August 11, 2022, 11:54am

TL;DR : a simple proposal for a “sane default” for any kind of classification training and inference, significantly mitigating the problem of different class distributions between training and inference.

Testing on badly unbalanced datasets shows 10+% improvement in accuracy

Detail:

In academia, and in AI courses, we usually deal with nicely balanced datasets. Unfortunately, in industry your datasets are usually a compromise on ability to collect relevant data for some categories of interest. I work in the medical field, where you never have enough patient data. Worse, the data you do get is usually totally unrepresentative of the general population. Training on a dataset that is different from your expected inference time data is a recipe for excessively optimistic expectations followed by a harsh impact with reality.

A few weeks ago I suggested a theoretical way to solve this:

@jeremy thought this would be worth experimenting on. So I’ve spent some of my holiday doing exactly that, with very encouraging results.

The notebook on github shows exactly what I did, and gives detailed results for 1, 5 and 20 epochs, with 10+% accuracy improvement on a difficult unbalanced dataset, based on ImageNette.

Please take a look, and let me know if this seems like a reasonable “sane default” to include for fast.ai