Why are we trying to make mean 0 and standard deviation 1 for activation layers?

Could someone please explain why are we trying to make the mean 0 and standard deviation 1 for activations ( through kaiming init and other methods ) ?

1 Like

It helps about activations explosion/vanishing. Mean 0 STD deviation 1 lasts a long time without this happening (math property).

There are a few issues with the STD getting close to 0, such as more underflow, especially in fp16.

Without the activations effectively being “normalized” you lose the property of 1 weight not contributing too much to the final result.

Mean 0 is mostly there for 0x0=0 and 0+0=0.
Std=1 is for 1x1=1
These do not hold over many layers, as these are randomly sampled and small variations cause this to break.

If this is still confusing I wouldn’t worry about it too much. It takes awhile for all of these “important” things to click at the practical level. Which is why I really like Jeremy’s approach of practical first.


Hey !
Thanks for the explanation, but could you please explain a little more about what do you mean by activation explosion/vanishing ( As in I saw in the lecture what is happening but I didn’t fully understand ) ?


Explosion comes from things greater than one being multiplied, so if you have 5 layers, 10^5 = 100000. This applies to standard deviation as well, Ie, the thing im being multiplied the most with is greater than 1, so things explode, or I eventually become infinity!

Vanishing comes from things lower than 1 being multiplied, 0.1^5=0.00001, which also can be done with standard deviation. Ie, the thing im being multiplied the most with is less than 1, so I vanish. Or, I eventually BECOME 0

Both 0 and infinity above are meaningless, infinity because we have lost data that is too big, and 0 because anything with a std of 0 can only be the mean, ie if the mean is 0, it can only be all 0s. You will eventually run into the number being too small to be properly represented, just like you run into it being too big.

It is still useful to look into why we normalize data, and what non-normalized data means. We are effectively normalizing the data when we guarantee that the data is coming in with mean 0 and std 1.

We can get into the actual gradients as well, but they are related directly to this behavior.


Ok , all right got it now.
Thanks a lot !