What Jeremy just said about not taking anything for granted is very important: I saw that init at sqrt(5) and felt it was weird a couple of months ago, but then dismissed it, thinking smarter people than me had put it here for a reason.
I should have investigated more
The idea of taking A^k for big k is similar whether A is a matrix or a number. In the case of a number, if |A| < 1, A^k --> 0 for big k. If |A| > 1, A^k --> diverges. (To give you an intuition behind vanishing or exploding gradients).
Yes. But how do you do it in init without computing activations?
it would be cool to connect clever initialisation tricks with loss visualisations - is it on the plan for the next lessons?
Setting default values in any deep learning software is a recipe for a future bug
this makes sense, thanks!
Iâm still confused why weâre talking about sqrt(5) but the twitter thread that was referenced clearly had sqrt(3). what am I missing?
I mean less efficient performance isnât a bug though is it?
I agree default values have been less than ideal and it was one of the biggest attributes of fast.ai coming out.
Similar to this PyTorch caveat, OpenCV stores (RGB) channels in reverse too!
The reason again (that I found) is âfollowing the orignal approachâ
Deep learning sounds a lot like medicine. Lots of âwhy do we do it that way?..because thatâs the way weâve always done itâŚâ
Thatâs a good question; my mental model is that init is about making sure the initial state of the network passes activations and gradients at roughly the right magnitudes, and batchnorm is about maintaining that âinvariantâ after training disturbs the weights, but maybe thatâs not the best way to think about it.
Hereâs are the two instances in source code where they use sqrt(5)
:
Weâve focused a lot of our attention on ConvNetsâŚWhat are the recommended init approach for RNNs?
No itâs a good way, but you canât really initialize your stats without doing the math, thatâs just what I was saying. If you want to compute them in function of what your activations are, itâs during training, so like BatchNorm.
Even trickier because of the recurrence, so youâll have exploding activations and gradients even more easily. The same kaiming (usually uniform) works well.
Yeah, I get that sqrt(5) is in use, but I also see that sqrt(3) is in use, and it appears to me that someone somewhere is confusing sqrt(5) for sqrt(3) because thereâs no discussion that I can see about how these two numbers are related and why they are different.
Is there any particular reason you know of uniform is used in PyTorch over gaussian?
sqrt(3) comes from the uniform initialization vs normal. Thatâs because the std of a centered uniform distribution is sqrt(3).
No idea, I think itâs another case of ancestral recipe
Gradient Clipping vs Good init, which is better for vanishing/exploding gradient?