Lesson 9 Discussion & Wiki (2019)

sgugger · March 26, 2019, 2:00am

What Jeremy just said about not taking anything for granted is very important: I saw that init at sqrt(5) and felt it was weird a couple of months ago, but then dismissed it, thinking smarter people than me had put it here for a reason.
I should have investigated more

rachel · March 26, 2019, 2:01am

The idea of taking A^k for big k is similar whether A is a matrix or a number. In the case of a number, if |A| < 1, A^k --> 0 for big k. If |A| > 1, A^k --> diverges. (To give you an intuition behind vanishing or exploding gradients).

sgugger · March 26, 2019, 2:02am

Yes. But how do you do it in init without computing activations?

andrea · March 26, 2019, 2:02am

it would be cool to connect clever initialisation tricks with loss visualisations - is it on the plan for the next lessons?

gmohandass · March 26, 2019, 2:02am

Setting default values in any deep learning software is a recipe for a future bug

alando · March 26, 2019, 2:02am

this makes sense, thanks!

andavargas · March 26, 2019, 2:02am

I’m still confused why we’re talking about sqrt(5) but the twitter thread that was referenced clearly had sqrt(3). what am I missing?

Gabriel_Syme · March 26, 2019, 2:03am

I mean less efficient performance isn’t a bug though is it?

I agree default values have been less than ideal and it was one of the biggest attributes of fast.ai coming out.

init_27 · March 26, 2019, 2:04am

Similar to this PyTorch caveat, OpenCV stores (RGB) channels in reverse too!
The reason again (that I found) is “following the orignal approach”

neuradai · March 26, 2019, 2:04am

Deep learning sounds a lot like medicine. Lots of “why do we do it that way?..because that’s the way we’ve always done it…”

ewjordan · March 26, 2019, 2:04am

That’s a good question; my mental model is that init is about making sure the initial state of the network passes activations and gradients at roughly the right magnitudes, and batchnorm is about maintaining that “invariant” after training disturbs the weights, but maybe that’s not the best way to think about it.

JoshVarty · March 26, 2019, 2:04am

Here’s are the two instances in source code where they use sqrt(5):

github.com

pytorch/pytorch/blob/3df79f403e8b9621d5adb0447266becd10d633b0/torch/nn/modules/conv.py#L45-L51


def reset_parameters(self):
    n = self.in_channels
    init.kaiming_uniform_(self.weight, a=math.sqrt(5))
    if self.bias is not None:
        fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
        bound = 1 / math.sqrt(fan_in)
        init.uniform_(self.bias, -bound, bound)

github.com

pytorch/pytorch/blob/3df79f403e8b9621d5adb0447266becd10d633b0/torch/nn/modules/linear.py#L58-L63


def reset_parameters(self):
    init.kaiming_uniform_(self.weight, a=math.sqrt(5))
    if self.bias is not None:
        fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
        bound = 1 / math.sqrt(fan_in)
        init.uniform_(self.bias, -bound, bound)

sparalic · March 26, 2019, 2:05am

We’ve focused a lot of our attention on ConvNets…What are the recommended init approach for RNNs?

sgugger · March 26, 2019, 2:05am

No it’s a good way, but you can’t really initialize your stats without doing the math, that’s just what I was saying. If you want to compute them in function of what your activations are, it’s during training, so like BatchNorm.

sgugger · March 26, 2019, 2:06am

Even trickier because of the recurrence, so you’ll have exploding activations and gradients even more easily. The same kaiming (usually uniform) works well.

andavargas · March 26, 2019, 2:06am

Yeah, I get that sqrt(5) is in use, but I also see that sqrt(3) is in use, and it appears to me that someone somewhere is confusing sqrt(5) for sqrt(3) because there’s no discussion that I can see about how these two numbers are related and why they are different.

PierreO · March 26, 2019, 2:07am

Is there any particular reason you know of uniform is used in PyTorch over gaussian?

sgugger · March 26, 2019, 2:07am

sqrt(3) comes from the uniform initialization vs normal. That’s because the std of a centered uniform distribution is sqrt(3).

sgugger · March 26, 2019, 2:08am

No idea, I think it’s another case of ancestral recipe

gmohandass · March 26, 2019, 2:08am

Gradient Clipping vs Good init, which is better for vanishing/exploding gradient?