Lesson 11 discussion and wiki

I played some more with 07a_lsuv nb. Here are some observations/notes:

  1. The sub argument shouldn’t be configurable as it gets reset to a value relative to the batch’s mean regardless of its initial value. (Unless, it’s meant to be used by some other way w/o lsuv, but it’d be very difficult to manually choose, as it varies from layer to layer with lsuv.)

    To prove that it doesn’t need to be configurable, fix the seed and re-run the nb once with sub set to 0 and then to 50, adding its value to the return list - after lsuv_module is run - m.relu.sub ends up being exactly the same value, regardless of its initial value.

class ConvLayer(nn.Module):
    def __init__(self, ni, nf, ks=3, stride=2, sub=50., **kwargs):
                                               ^^^^^^^ 
    [...]
    
def lsuv_module(m, xb):
    [...]
    return m.relu.sub, h.mean, h.std
           ^^^^^^^^^^
  1. making sub a parameter didn’t lead to improvements, but made things worse in my experiments. the value of sub seems to be a very sensitive one.

  2. this implementation of lsuv doesn’t check whether variance is tiny (no eps) or undefined (small bs w/ no variance) before dividing by it - it tests with bs=512 which won’t have any of these issues, which is far from a general case.

    using bs=2 requires a much much lower lr

  3. While experimenting I used a random reproducible seed, so it was helpful to analyse closer the cases where the network wasn’t training (so that I could turn different parts on/off). Most of the time lsuv seemed to be the culprit - so it is helpful in general, but also leads to nans at times at the lr used in the nb.

Also note that the original LSUV doesn’t tweak the mean, only the std. But w/o the mean tweak in the lesson nb, things don’t perform as well. So this is a bonus. And the nb version doesn’t implement the optional orthonormal init.

5 Likes