Highlighting the Big Transfer (BiT) paper from google that @jeremy pointed out on twitter, achieves SOTA on a wide variety of downstream computer vision tasks with fairly standard fine-tuning.
BiT-Large pre-trained on JFT-300M dataset appears to be a fairly simple pre-training process and architecture, Interestingly (for me anyways) they ditched BatchNorm in favor of a combo of GroupNorm + Weight Standardisation in order to train sufficiently large batches.
Fine-tuning uses SGD, Mixup, fixed resolution scaling rules, random square crops and image flips (where appropriate).
They will be releasing code + weights + examples soon
Aside from ditching BatchNorm, nothing seems to be too revolutionary (or maybe I’m wrong), glad to see these fastai basics are still killing it
They mention that additional hyperparamter tuning could lead to better results again. With @LessW2020’s optimizer work and @Diganta’s Mish activation function the community here could probably push these results In guessing