Does LSUV actually tell us to modify bias? I’m just wondering if that is actually something that was followed from the LSUV paper or if that is coming from Jeremy’s intuition.
I am working through 07a_lsuv and noticed a few things. First, it took me quite a while to actually wrap my head around what LSUV is actually doing. @hiromi has explained it about 30 different ways and I think I finally understand how it works, but I have a few more questions about LSUV that I’m hoping somebody can answer. The first question I have is about Jeremy using the relu sub variable instead of actual bias. I tried to change this to use the conv bias instead of the relu bias and it doesn’t work. So my question is how did Jeremy know to use the output of the activation to get the mean of 0?
Another thing that I found interesting after playing around with modifying the conv bias is when modifying the conv bias it actually does change the standard deviation because of the relu afterwards. I didn’t think about that happening until after I ran it and started thinking about it. Basically it changes because of the values that are being shifted are either going under the relu cutoff point or over it which will change the distribution.
Do you think you can post a link to the notebook having the code & outputs for the thing which does not work?
It does! Didn’t seem to hurt convergence in my experiments google colab
Sure, here is how I tried defining the ConvLayer:
def __init__(self, ni, nf, ks=3, stride=2, sub=0., **kwargs):
self.conv = nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=True)
self.relu = GeneralRelu(sub=sub, **kwargs)
def forward(self, x): return self.relu(self.conv(x))
def bias(self): return self.conv.bias #return -self.relu.sub
def bias(self,v): self.conv.bias = [v]*self.conv.bias.shape #self.relu.sub = -v
def weight(self): return self.conv.weight
I tried a few things the most recent being:
def lsuv_module(m, xb):
h = Hook(m, append_stat)
#while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias.data -= h.mean
while mdl(xb) is not None and ((abs(h.std-1) > 1e-3) or (abs(h.mean) > 1e-3)): m.weight.data /= h.std
I also tried
while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias.data -= h.mean #tried += and -=
and this never really finishes or if it does, the accuracy isn’t good.
for m in mods: print(lsuv_module(m, xb))
What’s the module it is being stuck on and what are the dynamics of the process? (mean/std series for the tensor passed into the modules as well as the one got out)
What does the telemetry (per-module activation means, stds, hists, percentage of activations close to zero) looks like for the training in this case?
Not in the paper. But more recent papers like ELU discuss why zero mean matters, and we also saw it when we defined GeneralRelu
Since you have understood LSUV well. But i m still trying get a hold of it
May be some q could sound silly
- Is purpose of LSUV is to keep activations’ mean 0 and std div 1 or its purpose is to make init weights and bias mean 0 and std div 1.
- why we are doing changes for weights and bias here , does doing this will ensure activations mean is close to zero and std div 1.
- when we call lsuv ,followed by run.fit does it uses modified weights and bias ? dint understand how changes that lsuv makes is getting reflected in actual weight used during fit.