I meant, what’s the issue you had with Lovely Tensors on the GPU data?
How did I not know lovely tensors before?! Really lovely.
They make tensors disappear in output if your tensor is on gpu (my case it was mps, it might be better with cuda)
My bad, I fixed MPS, but did not release a new version yet.
Or do you see a different issue? I would expect an exception, not empty output.
Has anyone come across ZerO initialization? It’s a technique which only uses zeros or ones for initialization.
I was curious to see how it would perform on the example from the lesson so I updated init_weights
as shown below and it led to a slight improvement in accuracy from 87.6
=> 87.9
.
def init_weights(m, **kwargs):
if isinstance(m, (nn.Conv1d,nn.Conv2d,nn.Conv3d)): torch.nn.init.eye_(torch.empty(m.in_channels, m.out_channels))
Released version 0.1.11, mps should work now, but please let me know if it does not, I don’t have a mac to test.
Works well, thank you! I love the idea! thank you for making this library :). if you needs help testing on m1 dm me.
Here is LSUV init as callback, unfortunately it underperforms compared to kaiming, here is a notebook with implementation.
#export
def _lsuv_stats(hook, mod, inp, outp):
acts = to_cpu(outp)
hook.mean = acts.mean()
hook.std = acts.std()
def lsuv_init(model, m, m_in, xb, eps=1e-3, log=print):
h = Hook(m, _lsuv_stats)
max_step = 100
with torch.no_grad():
while model(xb) is not None and (abs(h.std-1)>eps or abs(h.mean)>eps):
log(f'LSUV: {h.mean} {h.std} {max_step}')
m_in.bias -= h.mean
m_in.weight.data /= h.std
max_step -= 1
if max_step == 0:
break
log(f'LSUV: {m_in} {h.mean} {h.std} {max_step}')
h.remove()
def lsuv_layers(model):
conv_lin = [o for o in model.modules() if isinstance(o, (nn.Conv2d, nn.Linear))]
return zip(conv_lin, conv_lin)
class LSUVInit(Callback):
def __init__(self, layers=None, eps=1e-3, verbose=False):
"""Layers is a function that returns iterable of point of measurement and conv|linear to tweak"""
self.layers = layers if layers is not None else lsuv_layers
self.log = fc.noop if not verbose else print
self.eps = eps
def before_batch(self, learn):
if getattr(learn.model, 'lsuv_init', False): return
layers = list(self.layers(learn.model))
self.log('LSUV init', layers)
xb,_ = learn.batch
training = learn.model.training
learn.model.train(False)
with torch.no_grad():
for ms in layers:
self.log(ms)
lsuv_init(learn.model, *ms, xb, eps=self.eps, log=self.log)
learn.model.lsuv_init = True
learn.model.train(training)
print(f'LSUV init done on {len(layers)} layers')
Jeremy presented a simplified and improved version of lsuv, to get exactly the same values as in the lesson you can run:
def our_model_layers(model):
relus = [o for o in model.modules() if isinstance(o, (GeneralRelu, nn.ReLU))]
convs = [o for o in model.modules() if isinstance(o, nn.Conv2d)]
# if len(relus) < len(convs):
# relus = relus + convs[len(relus):]
return zip(relus,convs)
set_seed(42)
learn = MomentumLearner(get_model(act_gr), dls, F.cross_entropy, lr=0.2, cbs=cbs+[LSUVInit(our_model_layers, eps=0.001)])
learn.fit(3)
The difference to the LSUV authors (D. Mishkin & J. Matas) code implementation are following:
- We don’t preinitialise weights with
orthonormal
initialisation described by
Saxe, A. et al. (2013), code to do so is in the notebook.
BTW. PyTorch implementation of Saxe 2013 init:
nn.init.normal_
underperforms compared to implementation provided by Mishkin.
- We measure stats after activation, ignoring the last layer, while according to the LSUV PyTorch code they take stats directly at the convolution on all convolutions. It under performs a bit but it works for all networks and works better when orthonormal initialisation is used. I’ve set it as default for
LSUVInit()
.
I’m a whole month late to finally watch this, but I had to drop by and mention that this was a fantastic lecture.
I really appreciate all the effort put into walking slowly* through all the various initialisation+normalisation techinques, plotting the stats, and talking about the intuition/need behind them, even if some of them aren’t used eventually.
Also, I keep forgetting/getting confused on what BatchNorm actually does, so I’m glad to hear Jeremy make it very clear that it doesn’t necessarily do what we generally think of as normalisation(given the learnable parameters).
Lots of nuggets here, will have to come back again later and make notes.
Regarding Input normalization. (around 42 min)
One question/opinion about the input normalization. Is there a downside normalizing the input at the batch level? I mean if the dataset is not well balanced, let’s say so many dark and light pictures. There is a chance that lots of dark pictures in the same batch and normalized together, and light ones again normalized in isolation. Is this a problem?
For the batch sizes we use, this is not a problem in practice.
I don’t see you doing an orthonormal initialization in any of the code you show. Btw, it’s torch.init.orthogonal
not nn.init.normal_
. In my case in did increase the accuracy, 0.871 vs. 0.861.
Edit: Oops, now I see your implementation. Odd.
I can’t for the life of me figure out why the callback implementation under-performs to the one showed in the course, even though it’s clear from the graphs.
First of all, thousand thanks for the amazing course!
I might be utterly wrong, but in notebook 11_initializing.ipynb
shouldn’t we check for classes instead of instances in the conv
function?
I.e. instead of
def conv(ni, nf, ks=3, stride=2, act=nn.ReLU, norm=None, bias=None):
if bias is None: bias = not isinstance(norm, (nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d))
...
should it may be?
if bias is None: bias = not norm in (nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d)
This tiny difference in accuracy (0.003) isn’t meaningful. It might be caused by CUDA, it benchmark which convolution algorithms to use, they produce different weights due to floating point errors. I’ve seen bigger differences after switching GPUs, CUDA version, or simply after restart of my notebook. These differences vanish if you turn off the non-deterministic performance optimisations (Reproducibility — PyTorch 2.0 documentation), but then your training will run slower.
That is true, but I think the differences in other plots are significant, i.e. more dead neurons, activations don’t start at 0. and color_dim a lot more chaotic.
Do these differences vanish as well?
I wrote a blog on the part of this lesson covering Glorot init, Kaiming init, and general relu. Trying different parameters for general relu was fun.
Hopefully, this blog helps.
I wrote a part 2 of my blog covering LSUV, layer norm and batch norm. For batch norm, I did a deeper dive into the paper going over pseudo code and some math.
I also tried layer norm calculating means and variations outputting features like batch norm and performed better than the original layer norm and batch norm. Did anyone try this?
I am a new student if you will of your course, which I appreciate greatly. So first I really want to thank the team as a whole, it is top work.
Now I have a question, see the class SGD we define the following functions:
def step(self):
with torch.no_grad():
for p in self.params:
self.reg_step(p)
self.opt_step(p)
self.i +=1
def opt_step(self, p): p -= p.grad * self.lr
def reg_step(self, p):
if self.wd != 0: p *= 1 - self.lr*self.wd
I might be wrong but we should remove the learning rate from the reg_step as it will be multiplied in the optimization step. that comes next
Hello,
I hadn’t thought about that when I went through the course, but I think you are right.
Have you tried training models without a learning rate in reg_step
?