I played some more with 07a_lsuv nb. Here are some observations/notes:
-
The
sub
argument shouldn’t be configurable as it gets reset to a value relative to the batch’s mean regardless of its initial value. (Unless, it’s meant to be used by some other way w/o lsuv, but it’d be very difficult to manually choose, as it varies from layer to layer with lsuv.)To prove that it doesn’t need to be configurable, fix the seed and re-run the nb once with
sub
set to 0 and then to 50, adding its value to the return list - afterlsuv_module
is run -m.relu.sub
ends up being exactly the same value, regardless of its initial value.
class ConvLayer(nn.Module):
def __init__(self, ni, nf, ks=3, stride=2, sub=50., **kwargs):
^^^^^^^
[...]
def lsuv_module(m, xb):
[...]
return m.relu.sub, h.mean, h.std
^^^^^^^^^^
-
making
sub
a parameter didn’t lead to improvements, but made things worse in my experiments. the value ofsub
seems to be a very sensitive one. -
this implementation of lsuv doesn’t check whether variance is tiny (no eps) or undefined (small bs w/ no variance) before dividing by it - it tests with bs=512 which won’t have any of these issues, which is far from a general case.
using bs=2 requires a much much lower lr
-
While experimenting I used a random reproducible seed, so it was helpful to analyse closer the cases where the network wasn’t training (so that I could turn different parts on/off). Most of the time lsuv seemed to be the culprit - so it is helpful in general, but also leads to
nan
s at times at the lr used in the nb.
Also note that the original LSUV doesn’t tweak the mean, only the std. But w/o the mean tweak in the lesson nb, things don’t perform as well. So this is a bonus. And the nb version doesn’t implement the optional orthonormal init.