Why are we trying to make mean 0 and standard deviation 1 for activation layers?

ShaanShah · July 1, 2020, 1:39pm

Could someone please explain why are we trying to make the mean 0 and standard deviation 1 for activations ( through kaiming init and other methods ) ?

marii · July 1, 2020, 11:53pm

It helps about activations explosion/vanishing. Mean 0 STD deviation 1 lasts a long time without this happening (math property).

There are a few issues with the STD getting close to 0, such as more underflow, especially in fp16.

Without the activations effectively being “normalized” you lose the property of 1 weight not contributing too much to the final result.

Mean 0 is mostly there for 0x0=0 and 0+0=0.
Std=1 is for 1x1=1
These do not hold over many layers, as these are randomly sampled and small variations cause this to break.

If this is still confusing I wouldn’t worry about it too much. It takes awhile for all of these “important” things to click at the practical level. Which is why I really like Jeremy’s approach of practical first.

ShaanShah · July 2, 2020, 6:36am

Hey !
Thanks for the explanation, but could you please explain a little more about what do you mean by activation explosion/vanishing ( As in I saw in the lecture what is happening but I didn’t fully understand ) ?

marii · July 2, 2020, 11:50am

Sure,

Explosion comes from things greater than one being multiplied, so if you have 5 layers, 10^5 = 100000. This applies to standard deviation as well, Ie, the thing im being multiplied the most with is greater than 1, so things explode, or I eventually become infinity!

Vanishing comes from things lower than 1 being multiplied, 0.1^5=0.00001, which also can be done with standard deviation. Ie, the thing im being multiplied the most with is less than 1, so I vanish. Or, I eventually BECOME 0

Both 0 and infinity above are meaningless, infinity because we have lost data that is too big, and 0 because anything with a std of 0 can only be the mean, ie if the mean is 0, it can only be all 0s. You will eventually run into the number being too small to be properly represented, just like you run into it being too big.

It is still useful to look into why we normalize data, and what non-normalized data means. We are effectively normalizing the data when we guarantee that the data is coming in with mean 0 and std 1.

We can get into the actual gradients as well, but they are related directly to this behavior.

ShaanShah · July 2, 2020, 12:29pm

Ok , all right got it now.
Thanks a lot !

arnau · September 19, 2020, 1:38pm

I think it is also important to point out that this problem of activation explosion/vanishing is just regarding the first iterations of the training. BatchNorm actually learns for each layer the ‘‘optimal’’ mean and std deviation for every layer, so in later stages of the learning and probably once the model is fully trained the disitribution of the weights in a given layer will be different than: mean=0, std =1. Correct me if I am wrong, I am not an expert on this.

marii · September 19, 2020, 10:55pm

Batchnorm works on the activations, but besides that it sounds mostly right.

Though it isn’t that it shouldn’t be mean 0 std 1, the above exponential effect still exists, and even the data we pass into the model isn’t always going to have exactly mean=0 std=1. We normalize all of the dataset, not each batch.

Here we go! On average we should have mean 0, std 1, but these values that can vary a little bit. Take these small variances and apply the above exponential effect and we get the same problem as before. Batch norm forces mean 0,std 1, so that these small variances don’t cause our model to explode over many iterations.

Getting mean 0 std 1 without batch norm is still important though. You want what is passed into Batchnorm to only have “small” variances from mean 0 std 1. You can lose information when you normalize otherwise.