All you need is a good init
I found the idea from All you need is a good init by Dmytro Mishkin and Jiri Matas of empirically choosing the weights for the initialisation of your network super appealing.
My impression from reading the initialization portions of the Kaiming He paper (https://arxiv.org/abs/1502.01852), the Glorot and Bengio paper (http://proceedings.mlr.press/v9/glorot10a.html) and the Zhang, Dauphin and Ma paper (https://arxiv.org/abs/1901.09321) is that each paper is correcting previous initialization approaches to correct for changing network architectures. Glorot and Bengio are correcting for deeper architectures, He et al. are correcting for changing non-linearities while Zhang et al. are correcting for residual connections.
Taking the approach that Mishkin and Matas propose in ‘All you need is a good init’ means you need to worry much less about the specifics of your network architecture and so it is much more easily applicable to different architectures.
I decided to give implementing the initialization approach a go - my code is in a notebook here https://gist.github.com/simongrest/52404966f0c46f750a823a44618bb06c
Layer Sequential Unit-Variance Initialization
The main idea in the ‘All you need is a good init’ paper is an algorithm the authors call ‘Layer Sequential Unit-Variance Initialization’ or LSUV. Instead of trying to compute a formula for how to scale weights in terms of the dimensions of particular layers, the algorithm instead takes an empirical approach. You feed a batch of input data through the network layer by layer and adjust the initial weights of each layer until the scale of that layer’s outputs are sufficiently close to 1. Here is some pseudo-code for the algorithm.
for each layer L do:
Initialize weights of L (WL) with some reasonable starting point
(see the discussion of the Saxe et al. paper below)
do:
increment iteration counter Ti++
do the forward pass with a mini-batch
calculate the variance of the output of the layer - Var(L(xb))
Scale the weights WL by sqrt(Var(L(xb)))
i.e. WL = WL / sqrt(Var(L(xb)))
while
|Var(L(xb)) − 1.0| ≥ some tolerance and the Ti < max iterations
Orthonormal initialization
The authors recommend starting with a random orthogonal initialisation. I’ve written some functionality to do this initialisation using a singular value decomposition. I’ve attempted an explanation of what orthogonality is in this context and why it might be desirable in the notebook I linked to above.
def reset_parameters(self):
with torch.no_grad():
self.weight.normal_(0,1)
self.bias.zero_()
W = self.weight.data.view([self.weight.shape[0],-1])
_, _, Vt = torch.svd(W)
self.weight.data = torch.Tensor(Vt).view(self.weight.shape)
In the notebook I also briefly talk about what a singular value decomposition is - this is what the torch.svd
in the above is doing.
There’s a really nice blogpost https://hjweide.github.io/orthogonal-initialization-in-convolutional-layers that helped me think about orthogonality in the context of convolutions.
Comparing CNN output variance: LSUV
vs nn.Conv2d
initialization
I’ve run some experiments on the output variance of some CNNs of different depths.
I use a convenience function to create a model with the convolution class I specify and an arbitrary number of convolutional layers at the end.
def get_model(convtype=torch.nn.Conv2d, extra_depth=1):
model = torch.nn.Sequential(
convtype(1,8,5,stride=2,padding=2),
convtype(8,16,3,stride=2,padding=1),
convtype(16,32,3,stride=2,padding=1),
*[convtype(32,32,3,stride=2,padding=1)
for i in range(extra_depth)]
)
return model
In order to see the difference in the output variance between the two convolutional initializations I created and initialized 100 instances of four different model architectures with combinations of shallow (extra_dept=1
) and deep(extra_depth
=30) and nn.Conv2d
and my own OrthInitConv2D
initialized using LSUV
.
shallow_pytorch_stds = [get_model()(x).std().item() for i in range(100)]
deep_pytorch_stds = [get_model(extra_depth=30)(x).std().item()
for i in range(100)]
shallow_orthnormal_stds = [LSUV(get_model(convtype=OrthInitConv2D))(x)
.std().item() for i in range(100)]
deep_orthnormal_stds = [LSUV(get_model(convtype=OrthInitConv2D, extra_depth=30))(x)
.std().item() for i in range(100)]
Below is a plot of the histograms of the resulting output variances:
The standard deviations of the shallow PyTorch initialized nn.Conv2d
are all quite close to zero, centered around 0.095. For the deeper PyTorch network the standard deviations are even closer to zero - centered around 0.034.
On the other hand the LSUV initialized models both have output standard deviations very closely clustered around 1.
With regards to the stability of training the LSUV initialized shallow network seems to perform similarly to PyTorch initialized shallow network. I need to do some more experimentation with the deeper networks with more appropriate data - I’ll update this post once I have.
Thanks to @ste for helping me think through some of this.