L1 batch norm and other interesting works by D. Soudry

Hi all,

I’ve recently attended to a deep learning conference, and got particularly intrigued by one presentation from Daniel Soudry.

I think his presentation is particularly worth sharing on this forum because in addition to the nice theoretical work done (I hope I’ll understand it someday :)), Daniel Soudry reaches many interesting conclusions. In particular, from what I understood:

  1. Overfitting a bit is somehow healthy (it makes validation loss decrease, but somehow still increases validation accuracy)
  2. Weight decay regularization can be reproduced by just rescaling the learning rate. Therefore, it could be that using weight decay is redundant to using a good learning rate scheduling (has someone tested the impact of weight decay when using the 1-cycle technique?)
  3. Most interestingly, he experimentally shows that L2 batch norm is almost equivalent to a L1 batch norm rescaled by sqrt(pi/2). This last result could have a high impact, because L1 norm requires simpler computations (no square and square roots), and is more stable when working with half-precision floating point.

I remember @jeremy saying in one of his videos that very few researchers are interested in training neural nets more efficiently. In that regard, I think Soudry is someone to follow, since he seems to write nice papers on this topic. In particular, he wrote a paper on “Binarized Neural Networks” with Yoshua Bengio, which seems to achieve impressive results in terms of accuracy/efficiency trade-off.

What do you think on this presentation and the papers mentioned? Do you think it would be useful to integrate these techniques in fast.ai library?