Lesson 17 official topic

xl0 · December 15, 2022, 3:12am

I meant, what’s the issue you had with Lovely Tensors on the GPU data?

fredguth · December 15, 2022, 8:52pm

How did I not know lovely tensors before?! Really lovely.

piotr.czapla · December 18, 2022, 9:29am

They make tensors disappear in output if your tensor is on gpu (my case it was mps, it might be better with cuda)

xl0 · December 18, 2022, 1:17pm

My bad, I fixed MPS, but did not release a new version yet.

github.com/xl0/lovely-tensors

NotImplementedError when using Apple MPS device

opened 04:54PM - 08 Dec 22 UTC

closed 01:53PM - 09 Dec 22 UTC

malcolmsailor

Hi, great library! I have an M1 Macbook and if I try to show a tensor on Appl…e's "mps" device, I get the following exception: ``` >>> import torch >>> import lovely_tensors as lt >>> lt.monkey_patch() >>> torch.rand(3) tensor[3] x∈[0.115, 0.698] μ=0.389 σ=0.293 [0.353, 0.698, 0.115] >>> torch.rand(3, device="mps") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malcolm/venvs/mal_vq_vae/lib/python3.10/site-packages/lovely_tensors/patch.py", line 26, in __repr__ return str(StrProxy(self)) File "/Users/malcolm/venvs/mal_vq_vae/lib/python3.10/site-packages/lovely_tensors/repr_str.py", line 180, in __repr__ return to_str(self.t, plain=self.plain, verbose=self.verbose, File "/Users/malcolm/venvs/mal_vq_vae/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/Users/malcolm/venvs/mal_vq_vae/lib/python3.10/site-packages/lovely_tensors/repr_str.py", line 137, in to_str common = torch_to_str_common(t, color=color) File "/Users/malcolm/venvs/mal_vq_vae/lib/python3.10/site-packages/lovely_tensors/repr_str.py", line 70, in torch_to_str_common pinf = ansi_color("+Inf!", "red", color) if amax.isposinf() else None NotImplementedError: The operator 'aten::isposinf.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. ``` M1 support is still a bit patchy as you can see and it looks like you are relying on at least one not-yet-implemented operator. And if you don't have an M1 Mac, this might be hard for you to reproduce. One hacky fix might be to check if a tensor is on the mps device and if so, move a copy over to the cpu before doing the repr logic.

Or do you see a different issue? I would expect an exception, not empty output.

tommyc · December 18, 2022, 11:32pm

Has anyone come across ZerO initialization? It’s a technique which only uses zeros or ones for initialization.

I was curious to see how it would perform on the example from the lesson so I updated init_weights as shown below and it led to a slight improvement in accuracy from 87.6 => 87.9.

def init_weights(m, **kwargs):
    if isinstance(m, (nn.Conv1d,nn.Conv2d,nn.Conv3d)): torch.nn.init.eye_(torch.empty(m.in_channels, m.out_channels))

xl0 · December 19, 2022, 12:28pm

Released version 0.1.11, mps should work now, but please let me know if it does not, I don’t have a mac to test.

piotr.czapla · December 21, 2022, 7:21am

Works well, thank you! I love the idea! thank you for making this library :). if you needs help testing on m1 dm me.

piotr.czapla · December 27, 2022, 10:12am

Here is LSUV init as callback, unfortunately it underperforms compared to kaiming, here is a notebook with implementation.

#export
def _lsuv_stats(hook, mod, inp, outp):
    acts = to_cpu(outp)
    hook.mean = acts.mean()
    hook.std = acts.std()

def lsuv_init(model, m, m_in, xb, eps=1e-3, log=print):
    h = Hook(m, _lsuv_stats)
    max_step = 100
    with torch.no_grad():
        while model(xb) is not None and (abs(h.std-1)>eps or abs(h.mean)>eps):
            log(f'LSUV: {h.mean} {h.std} {max_step}')
            m_in.bias -= h.mean
            m_in.weight.data /= h.std
            max_step -= 1
            if max_step == 0: 
                break
        log(f'LSUV: {m_in} {h.mean} {h.std} {max_step}')
    h.remove()
    
def lsuv_layers(model):
    conv_lin = [o for o in model.modules() if isinstance(o, (nn.Conv2d, nn.Linear))]
    return zip(conv_lin, conv_lin)

class LSUVInit(Callback):
    def __init__(self, layers=None, eps=1e-3, verbose=False):
        """Layers is a function that returns iterable of point of measurement and conv|linear to tweak"""
        self.layers = layers if layers is not None else lsuv_layers
        self.log = fc.noop if not verbose else print
        self.eps = eps
    
    def before_batch(self, learn):
        if getattr(learn.model, 'lsuv_init', False): return
        layers = list(self.layers(learn.model))
        self.log('LSUV init', layers)
        xb,_ = learn.batch
        training  = learn.model.training
        learn.model.train(False)
        with torch.no_grad():    
            for ms in layers: 
                self.log(ms)
                lsuv_init(learn.model, *ms, xb, eps=self.eps, log=self.log)
        learn.model.lsuv_init = True
        learn.model.train(training)
        print(f'LSUV init done on {len(layers)} layers')

Jeremy presented a simplified and improved version of lsuv, to get exactly the same values as in the lesson you can run:

def our_model_layers(model):
    relus = [o for o in model.modules() if isinstance(o, (GeneralRelu, nn.ReLU))]
    convs = [o for o in model.modules() if isinstance(o, nn.Conv2d)]
    # if len(relus) < len(convs):
    #    relus = relus + convs[len(relus):]
    return zip(relus,convs)

set_seed(42)
learn = MomentumLearner(get_model(act_gr), dls, F.cross_entropy, lr=0.2, cbs=cbs+[LSUVInit(our_model_layers, eps=0.001)])
learn.fit(3)

The difference to the LSUV authors (D. Mishkin & J. Matas) code implementation are following:

We don’t preinitialise weights with orthonormal initialisation described by
Saxe, A. et al. (2013), code to do so is in the notebook.

BTW. PyTorch implementation of Saxe 2013 init: nn.init.normal_ underperforms compared to implementation provided by Mishkin.

We measure stats after activation, ignoring the last layer, while according to the LSUV PyTorch code they take stats directly at the convolution on all convolutions. It under performs a bit but it works for all networks and works better when orthonormal initialisation is used. I’ve set it as default for LSUVInit().

suvash · January 23, 2023, 9:21pm

I’m a whole month late to finally watch this, but I had to drop by and mention that this was a fantastic lecture.
I really appreciate all the effort put into walking slowly* through all the various initialisation+normalisation techinques, plotting the stats, and talking about the intuition/need behind them, even if some of them aren’t used eventually.
Also, I keep forgetting/getting confused on what BatchNorm actually does, so I’m glad to hear Jeremy make it very clear that it doesn’t necessarily do what we generally think of as normalisation(given the learnable parameters).
Lots of nuggets here, will have to come back again later and make notes.

nikem · April 3, 2023, 8:30am

Regarding Input normalization. (around 42 min)
One question/opinion about the input normalization. Is there a downside normalizing the input at the batch level? I mean if the dataset is not well balanced, let’s say so many dark and light pictures. There is a chance that lots of dark pictures in the same batch and normalized together, and light ones again normalized in isolation. Is this a problem?

jeremy · April 3, 2023, 6:57pm

For the batch sizes we use, this is not a problem in practice.

mizoru · August 1, 2023, 3:20pm

I don’t see you doing an orthonormal initialization in any of the code you show. Btw, it’s torch.init.orthogonal not nn.init.normal_. In my case in did increase the accuracy, 0.871 vs. 0.861.
Edit: Oops, now I see your implementation. Odd.

mizoru · August 1, 2023, 3:47pm

I can’t for the life of me figure out why the callback implementation under-performs to the one showed in the course, even though it’s clear from the graphs.

phonotax · August 1, 2023, 7:18pm

First of all, thousand thanks for the amazing course!

I might be utterly wrong, but in notebook 11_initializing.ipynb shouldn’t we check for classes instead of instances in the conv function?

I.e. instead of

def conv(ni, nf, ks=3, stride=2, act=nn.ReLU, norm=None, bias=None):
    if bias is None: bias = not isinstance(norm, (nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d))
    ...

should it may be?

    if bias is None: bias = not norm in (nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d)

piotr.czapla · August 4, 2023, 2:28pm

This tiny difference in accuracy (0.003) isn’t meaningful. It might be caused by CUDA, it benchmark which convolution algorithms to use, they produce different weights due to floating point errors. I’ve seen bigger differences after switching GPUs, CUDA version, or simply after restart of my notebook. These differences vanish if you turn off the non-deterministic performance optimisations (Reproducibility — PyTorch 2.0 documentation), but then your training will run slower.

mizoru · August 4, 2023, 6:31pm

That is true, but I think the differences in other plots are significant, i.e. more dead neurons, activations don’t start at 0. and color_dim a lot more chaotic.
Do these differences vanish as well?

galopy · October 13, 2023, 5:01am

I wrote a blog on the part of this lesson covering Glorot init, Kaiming init, and general relu. Trying different parameters for general relu was fun.

Hopefully, this blog helps.

galopy · October 20, 2023, 4:48am

I wrote a part 2 of my blog covering LSUV, layer norm and batch norm. For batch norm, I did a deeper dive into the paper going over pseudo code and some math.

I also tried layer norm calculating means and variations outputting features like batch norm and performed better than the original layer norm and batch norm. Did anyone try this?

Pierrot_La_Lune · January 11, 2024, 1:32pm

I am a new student if you will of your course, which I appreciate greatly. So first I really want to thank the team as a whole, it is top work.

Now I have a question, see the class SGD we define the following functions:

def step(self):
    with torch.no_grad():
        for p in self.params:
            self.reg_step(p)
            self.opt_step(p)
    self.i +=1

def opt_step(self, p): p -= p.grad * self.lr
def reg_step(self, p):
    if self.wd != 0: p *= 1 - self.lr*self.wd

I might be wrong but we should remove the learning rate from the reg_step as it will be multiplied in the optimization step. that comes next

galopy · January 11, 2024, 6:19pm

Hello,

I hadn’t thought about that when I went through the course, but I think you are right.
Have you tried training models without a learning rate in reg_step?