Regarding the function lsuv_module
from the notebook 07a_lsuv.:
def lsuv_module(m, xb):
h = Hook(m, append_stat)
# First: set means to zero
while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean
# Second: set standard dev. to one
while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std
h.remove()
return h.mean,h.std
Jeremy finds the following results in his notebook after executing the following line of code:
for m in mods: print(lsuv_module(m, xb)):
(0.17071205377578735, 1.0)
(0.08888687938451767, 1.0000001192092896)
(0.1499888300895691, 0.9999999403953552)
(0.15749432146549225, 1.0)
(0.3106708824634552, 1.0)
Now comes the thing, if I try to swap the order in the function lsuv_module
, so I define the function like this:
def lsuv_module(m, xb):
h = Hook(m, append_stat)
# First: set standard dev. to one
while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std
# Second: set means to zero
while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean
h.remove()
return h.mean,h.std
my results are the following:
(-4.66094616058399e-06, 1.0)
(2.151545231754426e-06, 1.0)
(2.430751919746399e-06, 1.0)
(1.562759280204773e-06, 1.0)
(2.0489096641540527e-07, 1.0000001192092896)
which are 6 orders of magnitude better. Just wondering if you were aware of it, apparently the order on which we operate on the weights matters: is not the same first normalize and then subtract means than the other way around. And apparently the difference is noticeable.
I still do not know if that improves training in the first steps, but as you emphasized so much that initialising weights matters I wanted to let you know about this difference.