Lesson 8 readings: Xavier and Kaiming initialization

Hi everyone!

As part of lesson 8 it was suggested to read the papers that introduced two kind of initializations: the Xavier initialization, and the Kaiming initialization.

fastai staff, and especially Rachel and Jeremy, always push us towards blogging, so I made a blog post explaining them as best as I can.

It’s a bit mathy because there’s not really a way around it, but it’s (way) more detailed than the papers so (hopefully) it should help you understand them better. Or at least understand a part you struggle with. And it’s the two papers in one.

Anyway, here’s the link.

Thanks to @dhoa, @simonjhb and @cqfd for helping me figure out a part.

I hope you like it. Obviously, all feedback would be greatly appreciated!

(if you find my post clear and useful enough I can add it to the ressources of lesson 8, but I’m waiting on your feedback to do that. Maybe it’s just confusing ¯\_(ツ)_/¯ )

44 Likes

Thanks, this is fantastic! I’ve been wrestling with this paper for a few hours, and wasn’t really making much progress, so this should be really helpful for me.

1 Like

Awesome work @PierreO . Next job - can you tell me why my average layer variances in the notebooks are nearly always quite a bit <1, especially in the later layers? Even after I use kaiming init at shifted relu?

(Skip ahead to the next notebook to see lots of charts of this.)

That’s the goal!

I noticed that as well… I’ll try to investigate. One thing I wasn’t sure about when I went though the paper is how true some of the assumptions of independence were, maybe it’s linked to that? Or maybe not.

The pixels and channels certainly aren’t independent…

1 Like

i tried to read both the paper and this link and problem for me is i don’t have a clue about variance.expectation and its formulas . A friend pointed to me that its part of random numbers theory , so i am currently trying to have a quick go at khan academy Random variables section .


Any other recommendations ?

1 Like

More broadly, the notions of expectation and variance are central to probability and statistics, so you might want to check out introductory courses on those subjects.

1 Like

This could obviously take a while to work through, but I absolutely loved Blitzstein’s Statistics 110 lectures from Harvard: https://www.youtube.com/watch?v=KbB0FjPg0mw&list=PL2SOU6wwxB0uwwH80KTQ6ht66KWxbzTIo

MIT’s intro probability course is also very nice, if a bit dryer than the Blitzstein videos: https://www.youtube.com/watch?v=1uW3qMFA9Ho&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6

Thank you for the great blog post! :slight_smile:

1 Like

Perhaps I am totally misunderstand the issue, but in my opinion, dependence between outputs and parameters is baked into the process of optimization. Independence of the outputs and weights would hold for the first epoch where weights are randomly initialized. After first parameter update based on backpropagation of gradients, the parameters no longer totally random in respect to the output. With each mini batch this dependency grows and this is what we want, because this accumulation of non-randomness is ''the learning" that we are after.

Pierre writes in the Backward-propagation section of the excellent blog post :

We also assume that wl and Δyl are independent of each other.

Based on my reasoning, this would be true epoch 1. For the consecutive epochs, this assumption might not hold.

Based on my reasoning, this would be true epoch 1. For the consecutive epochs, this assumption might not hold.

Remember that what we are trying to do here is to initialize the network, so we only care about epoch 1.

But even for epoch 1, I don’t think activations are really independent. After all, even if we assume that the input is independent, the very nature of convolution is to compute small parts of the image (or the activations) with common weights, and with redundancy. So I really don’t think independence really holds.

4 Likes

@PierreO I have one question after going through your blog post one more time:
How is the zero-mean Gaussian distribution with a standard deviation of sqrt(2/nl) and bias set to 0 derived from equation 5 from your blog post? (Is there some kind of polar coordinate transformation used?)

The math is quite advanced. I wonder if Kaiming He et al. had some intuition about what could lead to a more stable training in the first place?
Something like if my network is converging to slow/not at all I can try to decrease
the standard deviation of the zero-mean Gaussian distribution to improve training (this would be a nice experiment to try out with fastai/PyTorch).

I wonder if there is a way to get an optimal initialization function in an automated fashion, i.e., optimizing a generic Var[wl]=1 function. This could be interesting for other activation functions like ELU/SELU/GELU which are used in GANs where training stability can be an issue.

I think all that’s happening there is that equation (5) is a requirement we need to satisfy, but we can satisfy it however we wish. Drawing each weight independently from a normal distribution with variance 2/n_l is basically the simplest thing we could do, but if you really wanted to you could use some other distribution as long as it’ll give you the right variance.

5 Likes

Thank you very much, I was thinking too complex. :slight_smile:

Also PyTorch has different Kaiming He initialization functions with different distributions:
torch.nn.init.kaiming_normal_
torch.nn.init.kaiming_uniform_

1 Like

To be more accurate it’s note epoch 1, it’s more like first batch.

Your blog is very good, though I saw it after I had suffered through working out the math on my own :slight_smile: One question though I am wrestling with is why do we even look at backprop (I don’t mean the math but the logic of it). The weights get updated through backprop, so there is no initialization at that point. Also, by the same token, I don’t see how the assumption of independence of y and W holds in backprop (we use W to get to y). I would appreciate any help in understanding this part.

One question though I am wrestling with is why do we even look at backprop (I don’t mean the math but the logic of it). The weights get updated through backprop, so there is no initialization at that point.

There’s two issue that we’re dealing with (without proper intialization):

  • The vanishing or the exploding of the weights W;
  • The vanishing or the exploding of the gradients \Delta x.

Why the gradients also explode is explained by equation (9) in my post: same as in the forward pass, the variances of the layers are multiplied together (and that derives mainly from equation (7) that defines back-prop).

Also, by the same token, I don’t see how the assumption of independence of y and W holds in backprop (we use W to get to y). I would appreciate any help in understanding this part.

Unless I’m mistaken, there’s no assumption of independence between y and W. Could you point me to where it’s stated ?

5 Likes

I’m lost early, but I looked and am not sure which is the next notebook. Does anyone have a hint?

Thank you

Thomas

I think for backprop what is mentioned is that Δyl and wl are independent from each other, which will hold true.

Sorry I should have been specific - and it’s not quite the next notebook anyway! I meant this one.

1 Like