Lesson 9 Discussion & Wiki (2019)

What Jeremy just said about not taking anything for granted is very important: I saw that init at sqrt(5) and felt it was weird a couple of months ago, but then dismissed it, thinking smarter people than me had put it here for a reason.
I should have investigated more :wink:

21 Likes

The idea of taking A^k for big k is similar whether A is a matrix or a number. In the case of a number, if |A| < 1, A^k --> 0 for big k. If |A| > 1, A^k --> diverges. (To give you an intuition behind vanishing or exploding gradients).

8 Likes

Yes. But how do you do it in init without computing activations?

it would be cool to connect clever initialisation tricks with loss visualisations - is it on the plan for the next lessons? :slight_smile:

2 Likes

Setting default values in any deep learning software is a recipe for a future bug :smiley:

1 Like

this makes sense, thanks!

I’m still confused why we’re talking about sqrt(5) but the twitter thread that was referenced clearly had sqrt(3). what am I missing?

2 Likes

I mean less efficient performance isn’t a bug though is it?

I agree default values have been less than ideal and it was one of the biggest attributes of fast.ai coming out.

Similar to this PyTorch caveat, OpenCV stores (RGB) channels in reverse too!
The reason again (that I found) is “following the orignal approach”

4 Likes

Deep learning sounds a lot like medicine. Lots of “why do we do it that way?..because that’s the way we’ve always done it…”

10 Likes

That’s a good question; my mental model is that init is about making sure the initial state of the network passes activations and gradients at roughly the right magnitudes, and batchnorm is about maintaining that “invariant” after training disturbs the weights, but maybe that’s not the best way to think about it.

Here’s are the two instances in source code where they use sqrt(5):

We’ve focused a lot of our attention on ConvNets…What are the recommended init approach for RNNs?

1 Like

No it’s a good way, but you can’t really initialize your stats without doing the math, that’s just what I was saying. If you want to compute them in function of what your activations are, it’s during training, so like BatchNorm.

Even trickier because of the recurrence, so you’ll have exploding activations and gradients even more easily. The same kaiming (usually uniform) works well.

3 Likes

Yeah, I get that sqrt(5) is in use, but I also see that sqrt(3) is in use, and it appears to me that someone somewhere is confusing sqrt(5) for sqrt(3) because there’s no discussion that I can see about how these two numbers are related and why they are different.

Is there any particular reason you know of uniform is used in PyTorch over gaussian?

5 Likes

sqrt(3) comes from the uniform initialization vs normal. That’s because the std of a centered uniform distribution is sqrt(3).

1 Like

No idea, I think it’s another case of ancestral recipe :wink:

5 Likes

Gradient Clipping vs Good init, which is better for vanishing/exploding gradient?

1 Like