Lesson 17 official topic

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 16Lesson 18 >>>

Lesson resources

Links from the lesson

15 Likes

One simple thing that I would like to do, but haven’t had the time yet (swamped at work and fallen back with the lessons), but maybe somebody else can do it while I try and catch up: change the plots for means and variances of the different layers to use a gradient color map rather than default matplotlib one. In my mind this might help understand at a glance what is going on (because close layers will have similar colours)

1 Like

How to change the loss function inside a callback? I tried using a lambda *args, **kwargs: self.learn.loss_func(args, kwargs) + (something_else) there was a recursion error (RecursionError: maximum recursion depth exceeded)

can you call the _norm function, a z-score?

That’s a good idea! (I assume you mean for the plot_stats() method?)

1 Like

Exactly!

After this lesson I started to care about mean and std of my tensors much more. So I gave a try to lovely tensors, they make tensor looks much better in jupyter by showing mean and stddev by default.

Here is how inspection of ads from 09_learner look like.

And an example weight from our conv network after training 11_initializing.ipynb. Note that you need to_cpu to show the tensor, otherwise it won’t show when using lovely tensors.

It makes quick introspection faster to type. eg. Here is how easy was to test if reset_parameters (which is initialising conv2d weights with kaiming_uniform) is called on our conv network.

Which made me wonder why kaiming_uniform instead of normal and it seems that uniform seems to be a bit better in benchmark but normal was used by original Resnet paper. Here is a bit of info on the matter: neural network - When to use (He or Glorot) normal initialization over uniform init? And what are its effects with Batch Normalization? - Data Science Stack Exchange

But it makes me wonder why did we see such drastic improvements after initialisation with kaiming_normal while we already used kaiming_uniform? Any ideas?

7 Likes

I’ve found the reason why it works worse than ours, It is most likely due to use a=sqrt(5) which might be appropriate for leakyrelu but not for regular relu. The value there is for backward compatibility with torch7 as per #15314
soumith:

but it wanted to have the same end-result in weights, because we wanted to preserve backward-compatibility. So the sqrt(5) is nothing more than giving the code the same end-result as before.

What kind of error do you see?

So, after seeing the GeneralReLU today, I had a brilliant idea that I’m sure many other people had before me.

If we have this leak hyper-parameter, why don’t we make it part of the training and let the optimizer find the proper value?

class PGeneralRelu(nn.Module):
    def __init__(self, leak=0., sub=0., maxv=None):
        super().__init__()
        self.leak = nn.Parameter(torch.tensor(leak))
        self.sub = sub
        self.maxv = maxv

    def forward(self, x):
        # x = F.leaky_relu(x,self.leak) if self.leak is not None else F.relu(x)
        x = torch.max(x * self.leak, x)
        x.sub_(self.sub)
        if self.maxv is not None: x.clamp_max_(self.maxv)
        return x

And it indeed does work and finds some values for leak:

for m in model.modules():
    if isinstance(m, PGeneralRelu): print(m.leak.item())
0.04131656885147095
0.13582824170589447
-0.10154463350772858
-0.16783204674720764

What’s interesting, for the last 2 layers, the leak is actually negative, so the function looks like this:
image

But of course, the reason we never hear about this idea is that it does not seem to work. The results are close to the normal GeneralReLU(0.1, 0.4), but never quite reach it.

I’ve tried a variant with learnable sub too, but it’s very unstable if the layer before the non-linearity has a bias term, as the sub and the bias end up fighting and growing opposite ways until numeric instability kicks in.

Any idea why it does not seem to improve the results, and also, why the last 2 layers learn a non-monotonic activation?

4 Likes

You should check out the old Part 2 lesson 2, there’s a notebook dedicated to “Why sqrt(5)” :wink:

5 Likes

Very nice idea, and yes it’s been done! :smiley:

https://pytorch.org/docs/stable/generated/torch.nn.PReLU.html

2 Likes

:+1: Haha, did not know it was right on the surface! :slight_smile:

Thank you!!! I’m puzzled how I’ve missed this lesson, I recall twitter conversation about improving conv. default init. I watched the video again today and either I haven’t watch it or the reason why it disappeared from my memory was that it did not show the jump in accuracy. We were just seeing how the signal propagate through the network.

This lesson on the other hand was brilliant, as Jeremy showed experimentally the impact and compared it to few other initialisations.

BTW. The lesson mention a bug that PyTorch team opened to solve this issue, it seems it is still being processed, as they are trying to figure out a good way to keep PyTorch version compatible. #18182 is still open since 2019:slight_smile:

2 Likes

It seems that I’m not the only one that was ignoring the connection between activation function and the initialisation. I’ve checked few papers, all of them use he/kaiming, none explaining the a param they used, nor the distribution. The recent one from 2022 has code attached and either I’m missing something or they are using broken pytorch Conv2d initialisation.
Here is the source code and paper

The paper otherwise is interesting as they list activation functions. I’ve went through the RELU section, and the approach that Jeremy proposed is not mentioned.
Jeremy idea is a mix between: LeakyRelu and Flexible ReLU(x) = ReLU(x) + b (but here b is learnable here).arxiv.

@xl0 Flexible Relu is the one with parametrised sub, so maybe they have some discussion on bias (they used He/Kaiming for init without details)

@xl0 There is VRelu, that has similar shape to the one your activation learned, but it is gated behind iee paywall, and abstract does not mention rationale. But P

The only place I’ve found some insights what is happening when you train prelu was on pwc: PReLU Explained | Papers With Code .

2 Likes

pylance is getting crazy when it sees fastcore, the most problematic thing for it is store_attr so after that the code is littered with warnings and errors

Here is an example:

Which is kind of problematic as you lose all the new and shiny tips about data types, in python 3.11 there is a fix for that at least for libraries like sqlalchemy. A workaround I have right now is converting Learner to a @dataclass, so init is build for me, and pylance is able to cope, I’m playing around with this and the way we keep local scope. I will share once I have something that has 0 pylance reports.

1 Like

Yeah generally static-based approaches don’t work well with fast.ai’s heavily dynamic coding style. Our coding style is really designed for working in notebooks, where the full information about runtime types is available for introspection.

1 Like

What is the self.i variable for in the SGD code?

#course22p2/nbs/12_accel_sgd.ipynb
class SGD:
    def __init__(self, params, lr, wd=0.):
        params = list(params)
        fc.store_attr()
        self.i = 0

    def step(self):
        with torch.no_grad():
            for p in self.params:
                self.reg_step(p)
                self.opt_step(p)
        self.i +=1

    def opt_step(self, p): p -= p.grad * self.lr
    def reg_step(self, p):
        if self.wd != 0: p *= 1 - self.lr*self.wd

    def zero_grad(self):
        for p in self.params: p.grad.data.zero_()

It is counting the number of steps, but I can’t see it being used.

It’s used in a subclass (Adam).