Why do we need to divide by the 1 over the square root of the variance in order to bring down the variance?

#1

first look at this example

>>> x = t.randn(512)
>>> w = t.randn(512, 500000)
>>> (x @ w).var()
tensor(513.9548)

it makes sense that the variance is close to 512 because each one of 500000, is a dot product of a 512 vector and a 512 vector, that is sampled from a distribution with a standard deviation of 1 and mean of 0

However, I wanted the variance to go down to 1, and consequently the std to be 1 since standard deviation is square root of variance, where 1 is the variance.

To do this I tried the below

>>> x = t.randn(512)
>>> w = t.randn(512, 500000) * (1/512)
>>> (x @ w).var()
tensor(0.0021)

However the variance is actually now 512 / 512 / 512 instead of 512/ 512

In order to do this correctly, I needed to try

>>> x = t.randn(512)
>>> w = t.randn(512, 500000) * (1 / (512 ** .5))
>>> (x @ w).var()
tensor(1.0216)

Why is that the case?

0 Likes

(Michael) #2

A very good explanation can be found here: http://cs231n.github.io/neural-networks-2/
(scroll down to “Calibrating the variances with 1/sqrt(n)“).

1 Like

#3

Yup its in the paper, thanks! but could you show me a code example that proves this sentence?

0 Likes

(Michael) #4

Do you mean the first part?

X = torch.randn(32,64)
a = torch.tensor(4.)

assert (a * X).var() == a**2 * X.var()
assert (a.sqrt() * X).var() == a * X.var()
1 Like

#5

Ah, yes, so if I understood correctly, if we want to scale the variance of something by a scaler a(which for us we want 1/n), we should do so by a ** .5 which is (1/n) ** .5, because a inside our variance function acts really as multiplying the variance of X by a ** 2. This would mean that (a ** .5) would act as just multiplying by a when we multiply it to our X

meaning

Var(((1/n) ** .5) * X)  really is just 1/n * Var(X)
0 Likes

(Michael) #6

Inside means that a is applied to X before calculating the variance.

I added the other example also to the code above (which is in the end similar to the code posted by you).

The wikipedia article on variance has also some nice additional information, especially the properties which are used in the cs231n explanation.

Thanks for the question, it made me go through this important stuff again about initialization! :slight_smile:

0 Likes

#7

Yes indeed!

Also, no problem :slight_smile: although I should really be the grateful one!

0 Likes