LSUV Improvement?

In the 07a_lsuv.ipynb (lesson 11) notebook the LSUV initialization technique is implemented in 2 loops - the first for the mean and the second for the std. deviation, like this:

while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean
while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

When the mean and variance of the layers is examined the std is very close to 1 but the means aren’t quite 0. In the notebook for the example network it gives values like this:

(0.3387000262737274, 0.9999998807907104)
(0.0426153801381588, 1.0)
(0.18416695296764374, 1.0)
(0.17540690302848816, 0.9999998807907104)
(0.313778281211853, 1.0)

Jeremy suggests this is because the means are calculated before the variances.

If instead the mean and the variance are calculated together, in the same loop, then the mean does end up much closer to zero:

   while mdl(xb) is not None and (abs(h.mean) > 1e-3 or abs(h.std-1) > 1e-3):  
      m.bias -= h.mean
      m.weight.data /= h.std

Doing both in the same loop gives values such as these:

(9.123160005231057e-10, 1.0)
(9.647741450180547e-08, 1.0000001192092896)
(1.3737007975578308e-08, 1.0)
(7.55535438656807e-08, 1.0)
(-1.862645149230957e-08, 1.0)

However I’ve absolutely no idea if there’s any benefit to getting closer to zero for the mean!

7 Likes

I’ve also played a bit with that awesome tweak, and figured our you can run those loops 2 times in a row and also get near-zero means.

while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean
while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std
while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean
while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

results in:

(3.679674520640219e-08, 1.0)
(3.466800535534276e-08, 1.0000001192092896)
(1.7229467630386353e-08, 1.0)
(3.818422555923462e-08, 0.9999999403953552)
(-8.195638656616211e-08, 1.0)

I guess there is next to none computational overhead as we only init the model once.

The way you’ve refactored it may have a little flaw: it will still subtract mean if mean < 1e-3 but std-1 > 1e-3, and vice versa.

Or if you just switch the order of the normalization, and normalize the std first and then the mean, you also get means close to zero and stds close to one:

    # while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std
    while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean

(-1.9462740752373975e-08, 0.9999998807907104)
(-5.321842966310442e-10, 0.9999999403953552)
(6.05359673500061e-09, 1.0)
(-6.28642737865448e-09, 1.0)
(-2.421438694000244e-08, 0.9999998807907104)