Their contributions to SOTA results include
- Adaptive Gradient Clipping aka AGC, where gradients are clipped by ||gradient||/||weights|| when ||gradient||/||weights|| is greater than some lambda, and
- A normalizer free network aka NFNet architecture, a result of architecture search.
Their model does not seem to improve inference speed over efficientnet (for equivalent accuracy), but improves training speed by a large margin (5-8x) by removing batch norm from their models. I believe this has significant meaning for fastai, as it provides a great opportunity for faster training with transfer learning.
A summary and critique by yannic is done here. As yannic points out, it is not clear whether the improvements resulting in SOTA is from AGC or from architecture search. To me, It looks more like they implemented a bigger model resulting in a higher accuracy, so I’m not so much interested in the SOTA results themselves, but the removal of BN and its effects on transfer learning in medical datasets.
Furthermore, it would be nice to implement an AGC of individual samples of the minibatch before the mean() or sum() of the loss, if at all possible, rather than the clipping after the mean() or sum() of the loss as the paper suggests. Yannic suggests this at 23:30.
I’m new to fastai forums, and I’m happy to delete or move this topic to a subtopic if requested. I have used fastai since v1. One of my works, with fastai cited, is Explaining the Rationale of Deep Learning Glaucoma Decisions with Adversarial Examples - PubMed (nih.gov).
Thanks in advance!