TL;DR : a simple proposal for a “sane default” for any kind of classification training and inference, significantly mitigating the problem of different class distributions between training and inference.
Testing on badly unbalanced datasets shows 10+% improvement in accuracy
Detail:
In academia, and in AI courses, we usually deal with nicely balanced datasets. Unfortunately, in industry your datasets are usually a compromise on ability to collect relevant data for some categories of interest. I work in the medical field, where you never have enough patient data. Worse, the data you do get is usually totally unrepresentative of the general population. Training on a dataset that is different from your expected inference time data is a recipe for excessively optimistic expectations followed by a harsh impact with reality.
A few weeks ago I suggested a theoretical way to solve this:
@jeremy thought this would be worth experimenting on. So I’ve spent some of my holiday doing exactly that, with very encouraging results.
The notebook on github shows exactly what I did, and gives detailed results for 1, 5 and 20 epochs, with 10+% accuracy improvement on a difficult unbalanced dataset, based on ImageNette.
Please take a look, and let me know if this seems like a reasonable “sane default” to include for fast.ai