Lesson 17 official topic

I meant, what’s the issue you had with Lovely Tensors on the GPU data?

How did I not know lovely tensors before?! :exploding_head: Really lovely.

1 Like

They make tensors disappear in output if your tensor is on gpu (my case it was mps, it might be better with cuda)

My bad, I fixed MPS, but did not release a new version yet.

Or do you see a different issue? I would expect an exception, not empty output.

2 Likes

Has anyone come across ZerO initialization? It’s a technique which only uses zeros or ones for initialization.

I was curious to see how it would perform on the example from the lesson so I updated init_weights as shown below and it led to a slight improvement in accuracy from 87.6 => 87.9.

def init_weights(m, **kwargs):
    if isinstance(m, (nn.Conv1d,nn.Conv2d,nn.Conv3d)): torch.nn.init.eye_(torch.empty(m.in_channels, m.out_channels))
4 Likes

Released version 0.1.11, mps should work now, but please let me know if it does not, I don’t have a mac to test.

1 Like

Works well, thank you! I love the idea! thank you for making this library :). if you needs help testing on m1 dm me.

1 Like

Here is LSUV init as callback, unfortunately it underperforms compared to kaiming, here is a notebook with implementation.

#export
def _lsuv_stats(hook, mod, inp, outp):
    acts = to_cpu(outp)
    hook.mean = acts.mean()
    hook.std = acts.std()

def lsuv_init(model, m, m_in, xb, eps=1e-3, log=print):
    h = Hook(m, _lsuv_stats)
    max_step = 100
    with torch.no_grad():
        while model(xb) is not None and (abs(h.std-1)>eps or abs(h.mean)>eps):
            log(f'LSUV: {h.mean} {h.std} {max_step}')
            m_in.bias -= h.mean
            m_in.weight.data /= h.std
            max_step -= 1
            if max_step == 0: 
                break
        log(f'LSUV: {m_in} {h.mean} {h.std} {max_step}')
    h.remove()
    
def lsuv_layers(model):
    conv_lin = [o for o in model.modules() if isinstance(o, (nn.Conv2d, nn.Linear))]
    return zip(conv_lin, conv_lin)

class LSUVInit(Callback):
    def __init__(self, layers=None, eps=1e-3, verbose=False):
        """Layers is a function that returns iterable of point of measurement and conv|linear to tweak"""
        self.layers = layers if layers is not None else lsuv_layers
        self.log = fc.noop if not verbose else print
        self.eps = eps
    
    def before_batch(self, learn):
        if getattr(learn.model, 'lsuv_init', False): return
        layers = list(self.layers(learn.model))
        self.log('LSUV init', layers)
        xb,_ = learn.batch
        training  = learn.model.training
        learn.model.train(False)
        with torch.no_grad():    
            for ms in layers: 
                self.log(ms)
                lsuv_init(learn.model, *ms, xb, eps=self.eps, log=self.log)
        learn.model.lsuv_init = True
        learn.model.train(training)
        print(f'LSUV init done on {len(layers)} layers')

Jeremy presented a simplified and improved version of lsuv, to get exactly the same values as in the lesson you can run:

def our_model_layers(model):
    relus = [o for o in model.modules() if isinstance(o, (GeneralRelu, nn.ReLU))]
    convs = [o for o in model.modules() if isinstance(o, nn.Conv2d)]
    # if len(relus) < len(convs):
    #    relus = relus + convs[len(relus):]
    return zip(relus,convs)

set_seed(42)
learn = MomentumLearner(get_model(act_gr), dls, F.cross_entropy, lr=0.2, cbs=cbs+[LSUVInit(our_model_layers, eps=0.001)])
learn.fit(3)

The difference to the LSUV authors (D. Mishkin & J. Matas) code implementation are following:

  1. We don’t preinitialise weights with orthonormal initialisation described by
    Saxe, A. et al. (2013), code to do so is in the notebook.

BTW. PyTorch implementation of Saxe 2013 init: nn.init.normal_ underperforms compared to implementation provided by Mishkin.

  1. We measure stats after activation, ignoring the last layer, while according to the LSUV PyTorch code they take stats directly at the convolution on all convolutions. It under performs a bit but it works for all networks and works better when orthonormal initialisation is used. I’ve set it as default for LSUVInit().
4 Likes

I’m a whole month late to finally watch this, but I had to drop by and mention that this was a fantastic lecture.
I really appreciate all the effort put into walking slowly* through all the various initialisation+normalisation techinques, plotting the stats, and talking about the intuition/need behind them, even if some of them aren’t used eventually.
Also, I keep forgetting/getting confused on what BatchNorm actually does, so I’m glad to hear Jeremy make it very clear that it doesn’t necessarily do what we generally think of as normalisation(given the learnable parameters).
Lots of nuggets here, will have to come back again later and make notes.

5 Likes

Regarding Input normalization. (around 42 min)
One question/opinion about the input normalization. Is there a downside normalizing the input at the batch level? I mean if the dataset is not well balanced, let’s say so many dark and light pictures. There is a chance that lots of dark pictures in the same batch and normalized together, and light ones again normalized in isolation. Is this a problem?

1 Like

For the batch sizes we use, this is not a problem in practice.

2 Likes

I don’t see you doing an orthonormal initialization in any of the code you show. Btw, it’s torch.init.orthogonal not nn.init.normal_. In my case in did increase the accuracy, 0.871 vs. 0.861.
Edit: Oops, now I see your implementation. Odd.

1 Like

I can’t for the life of me figure out why the callback implementation under-performs to the one showed in the course, even though it’s clear from the graphs.

First of all, thousand thanks for the amazing course!

I might be utterly wrong, but in notebook 11_initializing.ipynb shouldn’t we check for classes instead of instances in the conv function?

I.e. instead of

def conv(ni, nf, ks=3, stride=2, act=nn.ReLU, norm=None, bias=None):
    if bias is None: bias = not isinstance(norm, (nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d))
    ...

should it may be?

    if bias is None: bias = not norm in (nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d)

This tiny difference in accuracy (0.003) isn’t meaningful. It might be caused by CUDA, it benchmark which convolution algorithms to use, they produce different weights due to floating point errors. I’ve seen bigger differences after switching GPUs, CUDA version, or simply after restart of my notebook. These differences vanish if you turn off the non-deterministic performance optimisations (Reproducibility — PyTorch 2.0 documentation), but then your training will run slower.

That is true, but I think the differences in other plots are significant, i.e. more dead neurons, activations don’t start at 0. and color_dim a lot more chaotic.
Do these differences vanish as well?

I wrote a blog on the part of this lesson covering Glorot init, Kaiming init, and general relu. Trying different parameters for general relu was fun.

Hopefully, this blog helps.

2 Likes

I wrote a part 2 of my blog covering LSUV, layer norm and batch norm. For batch norm, I did a deeper dive into the paper going over pseudo code and some math.

I also tried layer norm calculating means and variations outputting features like batch norm and performed better than the original layer norm and batch norm. Did anyone try this?

1 Like

I am a new student if you will of your course, which I appreciate greatly. So first I really want to thank the team as a whole, it is top work.

Now I have a question, see the class SGD we define the following functions:

def step(self):
    with torch.no_grad():
        for p in self.params:
            self.reg_step(p)
            self.opt_step(p)
    self.i +=1

def opt_step(self, p): p -= p.grad * self.lr
def reg_step(self, p):
    if self.wd != 0: p *= 1 - self.lr*self.wd

I might be wrong but we should remove the learning rate from the reg_step as it will be multiplied in the optimization step. that comes next

Hello,

I hadn’t thought about that when I went through the course, but I think you are right.
Have you tried training models without a learning rate in reg_step?