Lesson 8 readings: Xavier and Kaiming initialization

I am having a little bit of trouble following your post. In particular, this step is not very clear:

\begin{align*} \mathbb{E}[x_l^2] &= \mathbb{E}\left[max(0, y_{l-1})^2 \right] \\ &= \frac{1}{2} \mathbb{E}\left[y_{l-1}^2\right] \\ &= \frac{1}{2} Var[y_{l-1}] \end{align*}

How do you expand the expectation of the max function? I think it’s in the way I wrote below, but I am not sure. Could you please clarify if that is correct? Since the max function is piecewise defined, I expanded the expectation of the max function as probability-weighted expectations of the pieces, but I am not sure if that is correct and such rule of expansion even exists (I tried to Google for expectation of the max function but the answer uses scary integrals and doesn’t explain in a clear way). I would be grateful for clarification:

\begin{align*} \mathbb{E}[x_l^2] &= \mathbb{E}\left[max(0, y_{l-1})^2 \right] \\ &= \mathbb{E} [0] \mathbb{P} (y_{l-1} \leq 0) + \mathbb{E} [y_{l-1}^2]\mathbb{P} (y_{l-1} > 0) \\ &= \frac{1}{2} \mathbb{E}\left[y_{l-1}^2\right] \\ &= \frac{1}{2} Var[y_{l-1}] \end{align*}

Since y_{l-1} is distributed symmetrically around 0, it’s positive half the time and negative half the time (symmetrically!), so if just keep the positive part max(0, -), you’re keeping half of the expectation. More explicitly, you can break the expectation up into when y_{l-1} is negative and when it’s positive.

\begin{align} \mathbb{E}[max(0, y_{l-1})^2] &= \mathbb{P}(y_{l-1} < 0) \cdot\mathbb{E}[max(0, y_{l-1})^2 | y_{l-1} < 0] + \mathbb{P}(y_{l-1} > 0) \cdot \mathbb{E}[max(0, y_{l-1})^2 | y_{l-1} > 0] \\ &= \frac{1}{2}\mathbb{E}[max(0, y_{l-1})^2 | y_{l-1} < 0] + \frac{1}{2}\mathbb{E}[max(0, y_{l-1})^2 | y_{l-1} > 0] \end{align}

The first term is zero and the second term is \frac{1}{2}\mathbb{E}[y_{l-1}^2].

Also, \mathbb{E}[y_{l-1}] = 0 since y_{l-1} is symmetric around 0, so Var[y_{l-1}] = \mathbb{E}[y_{l-1}^2].

Edit: Sorry, I couldn’t see the bottom of your post for some reason (gave me some sort of rendering error). You can calculate the expectation by e.g.

\begin{align} \mathbb{E}[max(0, y_{l-1})^2] &= \int_{-\infty}^{\infty} max(0, y_{l-1})^2 f(y_{l-1}) dy_{l-1} \\ &= \int_{-\infty}^{0} ... + \int_{0}^{\infty} ... \end{align}

(I’ve denoted the density of y_{l-1} by f.)

Anyway, the two integrals are the two pieces you’re looking for (the “rule of expansion” you refer to). It’s just splitting up an integral into two pieces.


I was looking into pytorch implementation of these initializations and found that in order to calculate the bounds of uniform distribution, they multiply the standard deviation by square root of 3.

fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
std = gain * math.sqrt(2.0 / (fan_in + fan_out))
a = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation
with torch.no_grad():
    return tensor.uniform_(-a, a)

I wonder where that sqrt(3) came from. I can’t find anything about this relationship between normal and uniform distribution. I will appreciate if someone explain that or point in right direction.

This is explained in next lesson video

1 Like

This also came up in the study group at USF, so maybe you shouldn’t be kept waiting…

The \sqrt{3} comes from the standard deviation of a uniform distribution – if you select uniformly from [-a,\,a], the standard deviation is a/\sqrt{3}. (You can look it up from Wikipedia, but why is that the answer? – a question Jeremy posed the study group)

kaiming_uniform tries to solve the opposite problem: if a uniform distribution on [-a, \,a] has a standard deviation std (from your code snippet above), what is a? (That’s why the \sqrt{3} is in the numerator, rather than the denominator.)

This doesn’t have anything to do with normal distributions, by the way.


Ok, now i see where that came from. If you calculate the formula of std for uniform distribution on the interval [-a, a] you will get \frac{a}{\sqrt{3}}. Since \sigma^2 = \frac{(b-a)^2}{12} :
\sigma = \sqrt{\frac{(\alpha - (-\alpha))^2}{12}} = \sqrt{\frac{(2\alpha)^2}{12}} = \sqrt{\frac{4\alpha^2}{12}} = \sqrt{\frac{\alpha^2}{3}} = \frac{\alpha}{\sqrt{3}}

Thank you!


Question regarding NLP:

I believe I have sensible initialization of my layers, but since I am using a pre-trained embedding as a first layer, how can I control the standard deviation after that step?

I have checked and the mean activation is close to 0, but standard deviation after embeddings (for a given text) is a bit over 3!

(The rest of the layers have nicer std values, ranging from 0.35 after some convolutions to close to 1 for the final layers.)

Does anyone have any insights for this?

As I understand it what we need is:


But the authors establish the following equation:


This works but it is not the only way to solve the equation (e.g. 0.8 * 1.25 is also 1). Did the authors choose to make every multiplication equal 1 because it was easier? Can we expect some difference in performance if we selectively initialize different layers to different scales while conserving the overall multiplication = 1?

It needs to be equal to 1 because this derivation should work for arbitrary number of layers. The only way for the product to not go to zero or explode to infinity is to be 1. Since L can be 100 or even 1000 in theory, if it is anything but one, it either collapses to a very small number or explodes to infinity.


Another nice way to get some intuition about what @maxim.pechyonkin is saying is to look at the ‘All you need is a good init’ paper https://arxiv.org/abs/1511.06422 which takes an empirical approach to initialization. Instead of deriving formulas for the initializations of the weights in terms of the parameters of the network architecture, the authors just determine the appropriate scale for the weights by experiment. They feed a batch of inputs through the network layer by layer and scale the weights of each layer to ensure the output always has variance close to 1. The appeal of this approach is that it means you don’t continually have to think about different rules for initialization as you develop new architectures, you can just set them algorithmically.

I’ve also attempted an implementation of it here: https://forums.fast.ai/t/implementing-the-empirical-initialization-from-all-you-need-is-a-good-init/42284


I don’t think I understand this. Let’s say my initialization scheme is for Var(w_1) = 2*n_1*1.25, Var(w_2) = 2*n_2*0.8 ... Var(w_(t-1)) = 2*n_(t-1)*1.25, Var(w_t) = 2*n_t*0.8

This extends easily to an arbitrary number of layers and it still holds that:


for any L.

I don’t know if this would affect performance but I want to understand if this is still valid.

EDIT: I guess the difference between my reasoning and the authors’ is that the authors try to get every layer’s variance to be close to 1 whereas I am thinking for neither layer to be very far away from 1 (we can deviate a little bit but we have to come back soon to 1 lest the output’s variance be too large or too small)

Yes, this is still valid, at least as far as the argument in the paper is concerned. There are many other valid configurations (alternating between between 0.5 and 2 is another such configuration). That’s why they say in the paper that

A sufficient condition is:

\frac{1}{2}n_lVar[w_l] = 1, \quad \forall l.

(emphasis mine, as opposed to a necessary condition). So the answer to

Did the authors choose to make every multiplication equal 1 because it was easier?

is yes.


Is that so bad, if your standard deviations don’t get out of control later on (and actually end up closer to 1 in the end)? I’m wondering if you just noticed this somehow or if this is actually negatively impacting your training/results.

If you were really dead set on it, you could scale down your embedding by 3, and scale the next layer back up by 3.

The embedding vectors are located in a matrix so you could divide it by 3 ?

Hey, I published my own review on the Kaiming paper, without the derivation but with all the important intuitive concepts. I hope it serves as a complement of @PierreO’s post. You can find it here. Feedback is appreciated!


Hi @Kaspar and @mediocrates

I suspect you are both right. Multiplying the embedding matrix like that should be fine and, on the other hand, I am not sure yet this is having a measurable negative impact as it is. I was wondering about deviations in my model after watching lesson 8, that’s why I looked into it, but it may be fine as it is already (hopefully not: improvements are always welcome!)

I think I will probably try to “normalize” my embedding matrix and see how it goes. It’ll probably be inconclusive, but I’ll let you know if it is an unprecedented breakthrough :sweat_smile:

Hi @cqfd, were Statistics 110 lectures enough to understand the maths behind Kaiming He Initialisation paper and also Pierre’s blogpost? At the moment, I am really struggling as I don’t know about Expectation, Variances and Covariances… etc

Yep, Blitzstein definitely covers all those things, and in my opinion in a nice way too :slight_smile:


Awesome, thanks! :slight_smile:

Before diving into many hours of stats lectures I’d suggest checking out Khan Academy on statistics. If you’re starting from “What the hell is covariance?” (like me!) it will get you up to speed very quickly. To be fair, it only covers these concepts in scalar form, not the matrix stats used in the Kaiming paper, but at that point you’ll have the idea and you can decide how much further you want to take it.