I played some more with 07a_lsuv nb. Here are some observations/notes:
-
The
subargument shouldn’t be configurable as it gets reset to a value relative to the batch’s mean regardless of its initial value. (Unless, it’s meant to be used by some other way w/o lsuv, but it’d be very difficult to manually choose, as it varies from layer to layer with lsuv.)To prove that it doesn’t need to be configurable, fix the seed and re-run the nb once with
subset to 0 and then to 50, adding its value to the return list - afterlsuv_moduleis run -m.relu.subends up being exactly the same value, regardless of its initial value.
class ConvLayer(nn.Module):
def __init__(self, ni, nf, ks=3, stride=2, sub=50., **kwargs):
^^^^^^^
[...]
def lsuv_module(m, xb):
[...]
return m.relu.sub, h.mean, h.std
^^^^^^^^^^
-
making
suba parameter didn’t lead to improvements, but made things worse in my experiments. the value ofsubseems to be a very sensitive one. -
this implementation of lsuv doesn’t check whether variance is tiny (no eps) or undefined (small bs w/ no variance) before dividing by it - it tests with bs=512 which won’t have any of these issues, which is far from a general case.
using bs=2 requires a much much lower lr
-
While experimenting I used a random reproducible seed, so it was helpful to analyse closer the cases where the network wasn’t training (so that I could turn different parts on/off). Most of the time lsuv seemed to be the culprit - so it is helpful in general, but also leads to
nans at times at the lr used in the nb.
Also note that the original LSUV doesn’t tweak the mean, only the std. But w/o the mean tweak in the lesson nb, things don’t perform as well. So this is a bonus. And the nb version doesn’t implement the optional orthonormal init.