Why do we need to divide by the 1 over the square root of the variance in order to bring down the variance?

mlnoob · October 25, 2019, 3:56am

first look at this example

>>> x = t.randn(512)
>>> w = t.randn(512, 500000)
>>> (x @ w).var()
tensor(513.9548)

it makes sense that the variance is close to 512 because each one of 500000, is a dot product of a 512 vector and a 512 vector, that is sampled from a distribution with a standard deviation of 1 and mean of 0

However, I wanted the variance to go down to 1, and consequently the std to be 1 since standard deviation is square root of variance, where 1 is the variance.

To do this I tried the below

>>> x = t.randn(512)
>>> w = t.randn(512, 500000) * (1/512)
>>> (x @ w).var()
tensor(0.0021)

However the variance is actually now 512 / 512 / 512 instead of 512/ 512

In order to do this correctly, I needed to try

>>> x = t.randn(512)
>>> w = t.randn(512, 500000) * (1 / (512 ** .5))
>>> (x @ w).var()
tensor(1.0216)

Why is that the case?

MicPie · October 25, 2019, 5:41am

A very good explanation can be found here: http://cs231n.github.io/neural-networks-2/
(scroll down to “Calibrating the variances with 1/sqrt(n)“).

mlnoob · October 25, 2019, 6:26am

Yup its in the paper, thanks! but could you show me a code example that proves this sentence?

MicPie · October 25, 2019, 7:05am

Do you mean the first part?

X = torch.randn(32,64)
a = torch.tensor(4.)

assert (a * X).var() == a**2 * X.var()
assert (a.sqrt() * X).var() == a * X.var()

mlnoob · October 25, 2019, 7:23am

Ah, yes, so if I understood correctly, if we want to scale the variance of something by a scaler a(which for us we want 1/n), we should do so by a ** .5 which is (1/n) ** .5, because a inside our variance function acts really as multiplying the variance of X by a ** 2. This would mean that (a ** .5) would act as just multiplying by a when we multiply it to our X

meaning

Var(((1/n) ** .5) * X)  really is just 1/n * Var(X)

MicPie · October 25, 2019, 7:49am

Inside means that a is applied to X before calculating the variance.

I added the other example also to the code above (which is in the end similar to the code posted by you).

The wikipedia article on variance has also some nice additional information, especially the properties which are used in the cs231n explanation.

Thanks for the question, it made me go through this important stuff again about initialization!

mlnoob · October 25, 2019, 8:02am

Yes indeed!

Also, no problem although I should really be the grateful one!