Lsuv_module - Fastai Course-v3: Order Matters

arnau · August 5, 2020, 12:05pm

Regarding the function lsuv_module from the notebook 07a_lsuv.:

def lsuv_module(m, xb):

    h = Hook(m, append_stat)
    
    # First: set means to zero 
    while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean

    # Second: set standard dev. to one
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

    h.remove()
    return h.mean,h.std

Jeremy finds the following results in his notebook after executing the following line of code:

for m in mods: print(lsuv_module(m, xb)):   

(0.17071205377578735, 1.0)
(0.08888687938451767, 1.0000001192092896)
(0.1499888300895691, 0.9999999403953552)
(0.15749432146549225, 1.0)
(0.3106708824634552, 1.0)

Now comes the thing, if I try to swap the order in the function lsuv_module, so I define the function like this:

def lsuv_module(m, xb):

    h = Hook(m, append_stat)
    
    # First: set standard dev. to one
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

    # Second: set means to zero 
    while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean


    h.remove()
    return h.mean,h.std

my results are the following:

(-4.66094616058399e-06, 1.0)
(2.151545231754426e-06, 1.0)
(2.430751919746399e-06, 1.0)
(1.562759280204773e-06, 1.0)
(2.0489096641540527e-07, 1.0000001192092896)

which are 6 orders of magnitude better. Just wondering if you were aware of it, apparently the order on which we operate on the weights matters: is not the same first normalize and then subtract means than the other way around. And apparently the difference is noticeable.

I still do not know if that improves training in the first steps, but as you emphasized so much that initialising weights matters I wanted to let you know about this difference.

MicPie · August 6, 2020, 11:33am

Maybe FYI, there is also an implementation by the author of the paper that introduced this init technique:

arnau · August 7, 2020, 11:58am

Thanks @MicPie! I will take a look at it!

arnau · August 29, 2020, 10:34am

Indeed, his implementation is first dividing std then substracting mean.