Hi every one,
I’ve found really instructive the Planet exercise in lesson 2. So I decided to focus on this dataset, while tackling a different (I believe relevant) problem:
I think the F2 score proposed disregards how good are the models at classifying the minority labels, although these are most important to understand impacts on Amazonia (mining, logging, urbanization, roads, etc.).
The imbalance between classes is huge (the “clear” and “primary” labels appear more than 25000 times, whereas labels such “conventional_mine” only a hundred times or less). I think this imbalance cannot be fixed by oversampling the minority classes (as this will also create duplicates of the dominant ones).
So as a starting point, I focused only on images with “clear” weather and moreover, considered the forest as background, i.e. only selecting images where “primary” coexists with the rest of the labels (such that we do not need to care about "primary" anymore).
I ended up with a very small set of 2864 images. Still imbalanced, but at least following a linear distribution of classes (as opposed to exponential, see 1st Figure in Kernel https://goo.gl/241C2b).
For this small dataset, I followed the steps (leading to F2-scores above .9 in the original problem):
1) Start with small image sizes: sz=64.
2) Find learning rate for and fit (2 epochs, cycle_len=1, cycle_mult=2), leading to F2-score=0.7 (validation).
3) Enabling data augmentation and fitting (3 epochs, cycle_len=1, cycle_mult=2), F2=0.725
4) Unfreezing and training last layers ([lr/9,lr/3,lr], 3 epochs, cycle_len=1, cycle_mult=2), F2=0.728
5) Applying steps 2)- 4), but now using larger sizes, sz=128. I Achieved only F2=0.765. Although it kept improving, at this stage I did not push further for more epochs, as it had already started to overfit.
What would be the best way to tackle this kind of imbalance problem in multi-label classification? and how to get better results in this particular dataset?