Hi all,
I’ve recently attended to a deep learning conference, and got particularly intrigued by one presentation from Daniel Soudry.
I think his presentation is particularly worth sharing on this forum because in addition to the nice theoretical work done (I hope I’ll understand it someday :)), Daniel Soudry reaches many interesting conclusions. In particular, from what I understood:
- Overfitting a bit is somehow healthy (it makes validation loss decrease, but somehow still increases validation accuracy)
- Weight decay regularization can be reproduced by just rescaling the learning rate. Therefore, it could be that using weight decay is redundant to using a good learning rate scheduling (has someone tested the impact of weight decay when using the 1-cycle technique?)
- Most interestingly, he experimentally shows that L2 batch norm is almost equivalent to a L1 batch norm rescaled by sqrt(pi/2). This last result could have a high impact, because L1 norm requires simpler computations (no square and square roots), and is more stable when working with half-precision floating point.
I remember @jeremy saying in one of his videos that very few researchers are interested in training neural nets more efficiently. In that regard, I think Soudry is someone to follow, since he seems to write nice papers on this topic. In particular, he wrote a paper on “Binarized Neural Networks” with Yoshua Bengio, which seems to achieve impressive results in terms of accuracy/efficiency trade-off.
What do you think on this presentation and the papers mentioned? Do you think it would be useful to integrate these techniques in fast.ai library?