In the 07a_lsuv.ipynb (lesson 11) notebook the LSUV initialization technique is implemented in 2 loops - the first for the mean and the second for the std. deviation, like this:
while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std
When the mean and variance of the layers is examined the std is very close to 1 but the means aren’t quite 0. In the notebook for the example network it gives values like this:
Jeremy suggests this is because the means are calculated before the variances.
If instead the mean and the variance are calculated together, in the same loop, then the mean does end up much closer to zero:
while mdl(xb) is not None and (abs(h.mean) > 1e-3 or abs(h.std-1) > 1e-3): m.bias -= h.mean m.weight.data /= h.std
Doing both in the same loop gives values such as these:
However I’ve absolutely no idea if there’s any benefit to getting closer to zero for the mean!