Lesson 11 discussion and wiki

They are indeed identical: https://pytorch.org/docs/stable/torch.html#torch.addcdiv

p + (-lr / debias1) * grad_avg / ((sqr_avg/debias2).sqrt() + eps))

p + -lr * (grad_avg / debias1) / ((sqr_avg/debias2).sqrt() + eps))

Is there a pytorch guide that explains when to use this kind of do-many-things ops, as compared to doing it just the “normal” way, like in the expanded version above? I suppose the above way would be slower.

1 Like

Some questions that came up while going through the 09_optimizers.ipynb notebook:

For SGD with weight decay we use steppers = [weight_decay, sgd_step] so we run first weight_decay and then the output of it through sgd_step. However, in the notebook the full wd formula is shown as:
new_weight = weight - lr * weight.grad - lr * wd * weight.
Therefore, I wonder, if the sequential calculation is changing the weight to the “decayed weights”, which are then used for the sgd_step and we end up with something similar but not exactly what is outlined in the formula.

I tried to test the difference by combining

def weight_decay(p, lr, wd, **kwargs):
    p.data.mul_(1 - lr*wd)
    return p

def sgd_step(p, lr, **kwargs):
    p.data.add_(-lr, p.grad.data)
    return p

into a single (not really optimized) stepper

def sgd_weight_decay(p, lr, wd, **kwargs):
    sgd = -lr * p.grad.data
    wd = -lr * wd * p.data
    p.data.add_(sgd).add_(wd)
    return p

and run both 3x in the notebook with 10 epochs, but a real difference was not visible in the losses/accuracies.

1 Like

if we update with sgd first then wd (that is [sgd_step, weight_decay] ) then the result must be different, because we then use the sgd updated weights to calculate the wd term. However the difference is probably within the random variation of sgd

2 Likes

One of my favorite part of the fast.ai courses are that we are taught what are the latest state of the art methods.

I wonder for each new technique, does it just improve convergence, or does it improve the final result? Or both?

For example, should we expect a new optimizer to improve the final results? What about running batch norm?

Is it just a matter of reading the papers as well as trying things out?

Thanks

I had the same Idea but it didn’t work for me. When I run:
%time run.fit(2, learn)
after the weight initialization sometimes it works well but half of the time the loss either explodes or vanishes.
Did you attempted to train the model after the initialization? How did that work for you?

Bah! You’re right, @Kagan! It doesn’t train!

Indeed I only tested the stats and assumed the rest would just work! Thank you for checking that my suggestion was wrong!

I tried to re-normalize the stats part again, so std+bias+std, but no, that doesn’t work either.

Re-running lsuv_module twice destroyes training too!

for m in mods: print(lsuv_module(m, xb), lsuv_module(m, xb))

Anybody has an idea why? It looks like it’s very important that the bias is not zero-centered!

2 Likes

It was zeroed in their original pytorch impl anyway (github). In my experiments zero init was as good as fast.ai-style

Would it be possible to have a look at your full code/telemetry? Curious if we ran into the same thing

Had a similar symptom 1/2 of the trainings

Fixed by reducing LR from 1 to 0.4, here’s how that looked on telemetry

  1. early blow-up (img)
  2. with leftover variation(img)

It’s just 07a_lsuv.ipynb as is.

Here is another variant I tried, attempting to balance the adjustments, while having std=1, mean=0 post-lsuv_module:

def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while mdl(xb) is not None and (abs(mean) > 1e-3 or abs(std-1) > 1e-3):
        mean,std = h.mean,h.std
        if abs(mean)  > 1e-3: m.bias -= mean
        if abs(std-1) > 1e-3: m.weight.data /= std
        
    h.remove()
    return h.mean,h.std

(note: it recalculates mean/std twice, but I didn’t bother refactoring, since it’s just a proof of concept.)

it works better (i.e. trains), but still getting nans every so often. This is with the default lr=0.6 of that nb.

But, of course, the original nb gets nans too every so often, so that learning rate is just too high.

With a lower lr=0.1the original reversed order std+bias approach trains just fine too.

Here is a refactored “balanced” version:

def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while mdl(xb) is not None:
        mean,std = h.mean, h.std
        if abs(mean) > 1e-3 or abs(std-1) > 1e-3 :
            m.bias -= mean
            m.weight.data.div_(std)
        else: break
                
    h.remove()
    return h.mean,h.std

Perhaps self.sub in GeneralReLU needs to be a parameter and then the init will only affect the initial setting and then let the network tune it up.

2 Likes

Thank you for checking this out. it’s good to know that we can mitigate the exploitation/vanishing by reducing the learning rate.
It’s still kind off weird that this happening though, I’d expect lsuv std+bais to perform the best since it have the “perfect statistics”.
I thought that it might be related to the distribution of the weights after the initialization (having 0 mean, 11 std, but being skewed) so I plot a histogram after the initialization
for m in mods: plt.figure(); plt.hist(m.weight.view(-1).detach().cpu(), bins = 20)
but it looks pretty much the same when model trains normally and when it goes to NaN

I played some more with 07a_lsuv nb. Here are some observations/notes:

  1. The sub argument shouldn’t be configurable as it gets reset to a value relative to the batch’s mean regardless of its initial value. (Unless, it’s meant to be used by some other way w/o lsuv, but it’d be very difficult to manually choose, as it varies from layer to layer with lsuv.)

    To prove that it doesn’t need to be configurable, fix the seed and re-run the nb once with sub set to 0 and then to 50, adding its value to the return list - after lsuv_module is run - m.relu.sub ends up being exactly the same value, regardless of its initial value.

class ConvLayer(nn.Module):
    def __init__(self, ni, nf, ks=3, stride=2, sub=50., **kwargs):
                                               ^^^^^^^ 
    [...]
    
def lsuv_module(m, xb):
    [...]
    return m.relu.sub, h.mean, h.std
           ^^^^^^^^^^
  1. making sub a parameter didn’t lead to improvements, but made things worse in my experiments. the value of sub seems to be a very sensitive one.

  2. this implementation of lsuv doesn’t check whether variance is tiny (no eps) or undefined (small bs w/ no variance) before dividing by it - it tests with bs=512 which won’t have any of these issues, which is far from a general case.

    using bs=2 requires a much much lower lr

  3. While experimenting I used a random reproducible seed, so it was helpful to analyse closer the cases where the network wasn’t training (so that I could turn different parts on/off). Most of the time lsuv seemed to be the culprit - so it is helpful in general, but also leads to nans at times at the lr used in the nb.

Also note that the original LSUV doesn’t tweak the mean, only the std. But w/o the mean tweak in the lesson nb, things don’t perform as well. So this is a bonus. And the nb version doesn’t implement the optional orthonormal init.

5 Likes

Hi,

According to 10_augmentation.ipynb, using numpy is supposed to result in faster tensor creation, but that’s clearly not happening for me. Not sure what could be going on; I repeated this test with 2 more images:

That’s interesting. I guess it might depend on the Pillow, pytorch, and numpy versions…

Just a quick really pedantic domain specific comment on RandomResizeCrop which from lesson sounds like fastai is looking to replace with very smart perspective “warp” transform.

Jeremy made comments that objects are never wider or thinner in the real world and that RandomResizeCrop is perhaps trying to account for changes in perspective.

Knowing the behavior of camera lenses RandomResizeCrop still serves a purpose. For example, some lenses will result in faces in portraits (i.e. objects) appear wider while others will make it appear thinner.

See site as first google search reference on this. https://mcpactions.com/2014/05/19/perfect-portrait-lens/

In short. I think a combination of both could be useful.

From my understanding of lens distortion that’s not quite true - they don’t squish an entire picture just horizontally or vertically. Or at least not enough to make the massive squishing of imagenet transforms sensible.

in the code of this lesson 11, is there any dropout being applied by pytorch undernearth during training? or does it have to be explicitely called for? thank you :wink:

Here is something I’m confused about:

At around 01:22:00 in the lesson, talking about momentum, Jeremy says that if you have 10 million activations, you need to store 10 million floats. Did he mean weights instead of activations? Because a bit before that he called them parameters (which, as I understand, are the same as weights), and a bit before that he also called the activations (which are very different).

Thanks.

Probably I have missed something, but I do not understand why we choose kernel size 3 insted of 5 for the first layer? If that is something proven by research, can you explain me why Jeremy has choosen the number of channels for the first layer to be c_inx3x3. I did not undestant that part neither :blush:

Actually I think it is in this other „bag of tricks“ paper, somewhat inflationary use of the term ;-):

Page 5:

The observation is that the computational cost of a convolution is quadratic to the kernel width or height. A 7 × 7 convolution is 5.4 times more expensive than a 3 × 3 convolution. So this tweak replacing the 7 × 7 convolution in the input stem with three conservative 3 × 3 convolutions […]

1 Like

Oops I copied from Google without checking - deleted my reply so others won’t be confused.

1 Like

Dear @jeremy, thank you sooo much for that moment: :100:

" we show that L2 reg has no regularisation effect…WHAT???:open_mouth: … "
" you know how I keep mentioning how none of us know what they are doing… that should make you feel better about ‘can I contribute to DL’ …? "

It does. I love your way of teaching, encouraging us, knocking down the barriers of entry to deep learning and gifting us all these tips and tools, but THAT WAS BY FAR THE BEST, MOST ENCOURAGING MOMENT so far for me. I was just thinking “my head is about to explode with all this info, I need a walk, fresh air” and boom you dropped the mic.

Thanks again and please keep in coming,
Lamine Gaye

PS: my 1st post on the forums… I just had to say this

2 Likes