Hi, to anyone who wants to go deeper I’d suggest a very nice Paper. “FIXUP INITIALIZATION: RESIDUAL LEARNING WITHOUT NORMALIZATION” introduces a new approach called fixed-update initialization (Fixup). The authors try to solve the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. Moreover, they found that training residual networks with Fixup is as stable as training with normalization, and, with proper regularization, it enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation. Enjoy the reading!!
Thanks for sharing, Fabrizio.
I’m struggling with the definition of residual blocks F and their count L in the paper. Anyone kind enough to draw boxes around F1 and F2 on the ResNet50 below (ignoring the presence of batchnorm layers for a moment)?
Am I told to initialize the weights of the upper conv2d with zeros?
Fixup initialization (or: How to train a deep residual network without normalization)
- Initialize the classification layer and the last layer of each residual branch to 0.
- Initialize every other layer using a standard method (e.g., Kaiming He), and scale only the weight layers inside residual branches by L^(-1/(2*m−2)).
- Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.