In “part 2” it is repeatedly stressed that a good init is necessary to prevent gradients vanishing or exploding, and LSUV seems a great improvement.
Here’s a test-case where LSUV potentially fails: the SEModule used in the wonderful xse_xresnexts has conv2d,relu,conv2d,sigmoid,prodlayer. Given a perfectly gaussian input to the SEModule, the easiest way to achieve gaussian output is to set the biases of the second conv2d layer to very large numbers. That makes the prodlayer always multiply by 1.0, removing the purpose of the SEModule but achieving the goal of LSUV and you would be none the wiser.
However, isn’t it the case that the real issue here is the nonlinearities themselves, rather than the layers with weights and biases that LSUV can attack?
For example, Jeremy goes to great lengths to improve ReLU - which is probably insight applicable to Swish, Mish, and everything else that is more-or-less ReLU, but with slightly different numbers.
Isn’t the “correct” solution here to take a fast.ai-type approach to automatically work out the bias+scale needed to be added after every nonlinearity in a model, simply by passing gaussian random data through the nonlinearity, and then implementing that post-nonlinearity bias+scale directly as part of the nonlinearity itself? For example, nn.Sigmoid requires (by experimentation with gaussian random numbersa scale of *4.8 and then a bias of -2.4 to be applied.
This also applies to some other module types - AdaptivePool for example takes many inputs and condenses them - which means the variance declines in proportion to the ratio of inputs to outputs. A simple estimator of the bias is 0 (it’s an average), and the scale is sqrt(len(input.flat)/len(output.flat))
Some modules are more complicated - take the SEModule - this has an activation that probably has all the issues Jeremy found for ReLU, and also has a Sigmoid layer, and finally a ProdLayer. However the intent of the SEModule is to feed into a ProdLayer to alter which channels in the input are kept active, so by inspection we can see that what we need after the Sigmoid is a scale of *2 and zero bias such that the ProdLayer output is mean=1.0.
However it strikes me that fast.ai could monkey-patch a great many nn.Modules to add a function of standard name (e.g. “renorm()”) that returns the constant bias and scale needed for that module (given the state of the module, such as input and output channels). The renorm() functions would be called for all modules prior to putting the parameters onto the GPU, and would generate parameters that would be included in the forward pass automatically under some defined name (e.g. .renormbias and .renormscale). These parameters would either be calculated by some known formula (such as for AvgPool), or by the empirical process of running gaussian random data through the module and verifying what comes out, which takes a trivial amount of time to do. Some modules, such as SEModule, might specifically edit their constituent components’ .renormbias/.renormscale because the intent of the module is not to generate gaussian outputs e.g. after the Sigmoid, so it would be important to call the parent renorm() after all the child renorm()s.
If this were done, then init of all the remaining weights and biases would simply be a case of using roughly uniform random numbers, possibly polished up using LSUV. The advantage over simply using LSUV is that we can fix up problems like the SEModule and no doubt other more complex modules. We also correct the underlying issue, the nonlinearities themselves, rather than hoping that the next layer has sufficiently adjustable weights and biases to correct for them.