Lesson 11 discussion and wiki

I had the same Idea but it didn’t work for me. When I run:
%time run.fit(2, learn)
after the weight initialization sometimes it works well but half of the time the loss either explodes or vanishes.
Did you attempted to train the model after the initialization? How did that work for you?

Bah! You’re right, @Kagan! It doesn’t train!

Indeed I only tested the stats and assumed the rest would just work! Thank you for checking that my suggestion was wrong!

I tried to re-normalize the stats part again, so std+bias+std, but no, that doesn’t work either.

Re-running lsuv_module twice destroyes training too!

for m in mods: print(lsuv_module(m, xb), lsuv_module(m, xb))

Anybody has an idea why? It looks like it’s very important that the bias is not zero-centered!


It was zeroed in their original pytorch impl anyway (github). In my experiments zero init was as good as fast.ai-style

Would it be possible to have a look at your full code/telemetry? Curious if we ran into the same thing

Had a similar symptom 1/2 of the trainings

Fixed by reducing LR from 1 to 0.4, here’s how that looked on telemetry

  1. early blow-up (img)
  2. with leftover variation(img)

It’s just 07a_lsuv.ipynb as is.

Here is another variant I tried, attempting to balance the adjustments, while having std=1, mean=0 post-lsuv_module:

def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while mdl(xb) is not None and (abs(mean) > 1e-3 or abs(std-1) > 1e-3):
        mean,std = h.mean,h.std
        if abs(mean)  > 1e-3: m.bias -= mean
        if abs(std-1) > 1e-3: m.weight.data /= std
    return h.mean,h.std

(note: it recalculates mean/std twice, but I didn’t bother refactoring, since it’s just a proof of concept.)

it works better (i.e. trains), but still getting nans every so often. This is with the default lr=0.6 of that nb.

But, of course, the original nb gets nans too every so often, so that learning rate is just too high.

With a lower lr=0.1the original reversed order std+bias approach trains just fine too.

Here is a refactored “balanced” version:

def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while mdl(xb) is not None:
        mean,std = h.mean, h.std
        if abs(mean) > 1e-3 or abs(std-1) > 1e-3 :
            m.bias -= mean
        else: break
    return h.mean,h.std

Perhaps self.sub in GeneralReLU needs to be a parameter and then the init will only affect the initial setting and then let the network tune it up.


Thank you for checking this out. it’s good to know that we can mitigate the exploitation/vanishing by reducing the learning rate.
It’s still kind off weird that this happening though, I’d expect lsuv std+bais to perform the best since it have the “perfect statistics”.
I thought that it might be related to the distribution of the weights after the initialization (having 0 mean, 11 std, but being skewed) so I plot a histogram after the initialization
for m in mods: plt.figure(); plt.hist(m.weight.view(-1).detach().cpu(), bins = 20)
but it looks pretty much the same when model trains normally and when it goes to NaN

I played some more with 07a_lsuv nb. Here are some observations/notes:

  1. The sub argument shouldn’t be configurable as it gets reset to a value relative to the batch’s mean regardless of its initial value. (Unless, it’s meant to be used by some other way w/o lsuv, but it’d be very difficult to manually choose, as it varies from layer to layer with lsuv.)

    To prove that it doesn’t need to be configurable, fix the seed and re-run the nb once with sub set to 0 and then to 50, adding its value to the return list - after lsuv_module is run - m.relu.sub ends up being exactly the same value, regardless of its initial value.

class ConvLayer(nn.Module):
    def __init__(self, ni, nf, ks=3, stride=2, sub=50., **kwargs):
def lsuv_module(m, xb):
    return m.relu.sub, h.mean, h.std
  1. making sub a parameter didn’t lead to improvements, but made things worse in my experiments. the value of sub seems to be a very sensitive one.

  2. this implementation of lsuv doesn’t check whether variance is tiny (no eps) or undefined (small bs w/ no variance) before dividing by it - it tests with bs=512 which won’t have any of these issues, which is far from a general case.

    using bs=2 requires a much much lower lr

  3. While experimenting I used a random reproducible seed, so it was helpful to analyse closer the cases where the network wasn’t training (so that I could turn different parts on/off). Most of the time lsuv seemed to be the culprit - so it is helpful in general, but also leads to nans at times at the lr used in the nb.

Also note that the original LSUV doesn’t tweak the mean, only the std. But w/o the mean tweak in the lesson nb, things don’t perform as well. So this is a bonus. And the nb version doesn’t implement the optional orthonormal init.



According to 10_augmentation.ipynb, using numpy is supposed to result in faster tensor creation, but that’s clearly not happening for me. Not sure what could be going on; I repeated this test with 2 more images:

That’s interesting. I guess it might depend on the Pillow, pytorch, and numpy versions…

Just a quick really pedantic domain specific comment on RandomResizeCrop which from lesson sounds like fastai is looking to replace with very smart perspective “warp” transform.

Jeremy made comments that objects are never wider or thinner in the real world and that RandomResizeCrop is perhaps trying to account for changes in perspective.

Knowing the behavior of camera lenses RandomResizeCrop still serves a purpose. For example, some lenses will result in faces in portraits (i.e. objects) appear wider while others will make it appear thinner.

See site as first google search reference on this. https://mcpactions.com/2014/05/19/perfect-portrait-lens/

In short. I think a combination of both could be useful.

From my understanding of lens distortion that’s not quite true - they don’t squish an entire picture just horizontally or vertically. Or at least not enough to make the massive squishing of imagenet transforms sensible.

in the code of this lesson 11, is there any dropout being applied by pytorch undernearth during training? or does it have to be explicitely called for? thank you :wink:

Here is something I’m confused about:

At around 01:22:00 in the lesson, talking about momentum, Jeremy says that if you have 10 million activations, you need to store 10 million floats. Did he mean weights instead of activations? Because a bit before that he called them parameters (which, as I understand, are the same as weights), and a bit before that he also called the activations (which are very different).


Probably I have missed something, but I do not understand why we choose kernel size 3 insted of 5 for the first layer? If that is something proven by research, can you explain me why Jeremy has choosen the number of channels for the first layer to be c_inx3x3. I did not undestant that part neither :blush:

Actually I think it is in this other „bag of tricks“ paper, somewhat inflationary use of the term ;-):

Page 5:

The observation is that the computational cost of a convolution is quadratic to the kernel width or height. A 7 × 7 convolution is 5.4 times more expensive than a 3 × 3 convolution. So this tweak replacing the 7 × 7 convolution in the input stem with three conservative 3 × 3 convolutions […]

1 Like

Oops I copied from Google without checking - deleted my reply so others won’t be confused.

1 Like

Dear @jeremy, thank you sooo much for that moment: :100:

" we show that L2 reg has no regularisation effect…WHAT???:open_mouth: … "
" you know how I keep mentioning how none of us know what they are doing… that should make you feel better about ‘can I contribute to DL’ …? "

It does. I love your way of teaching, encouraging us, knocking down the barriers of entry to deep learning and gifting us all these tips and tools, but THAT WAS BY FAR THE BEST, MOST ENCOURAGING MOMENT so far for me. I was just thinking “my head is about to explode with all this info, I need a walk, fresh air” and boom you dropped the mic.

Thanks again and please keep in coming,
Lamine Gaye

PS: my 1st post on the forums… I just had to say this


Can someone explain how are those two identical ? (the blue lines).

At around 28:00, Jeremy talks about get_files(), and how fast it is. I was intrigued, and decided to try to recreate it locally and experiment. I wanted to start off with a naive version and see what sort of speed ups I could achieve:

def get_fnames(train=False):
    path = Path('/Users/daniel/.fastai/data/imagenette-160')
    if train:
        path = path/'train'
        path = path/'val'
    fnames = []
    for _dir in path.ls():
        for fname in _dir.ls():

    return fnames
>>> fnames = get_fnames(train=True)
>>> len(fnames)
>>> timeit -n 10 get_fnames(train=True)
34.9 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> fnames[:3]

In the video, he shows ~70ms runtime, whereas I’m getting ~35, about twice as fast. This suggests I’ve misunderstood the task at hand or something … can anyone shed any light on what’s going on? Are we in fact doing the same thing? Why does my timing show a faster speed?

Don’t see this mentioned here in Lesson 11 thread, so adding a link here.

If you get the following errors when running 08_data_block.ipynb

  • cos_1cycle_anneal not defined
  • Runner does not have in_train

Link to forum thread discussing this and the solution - thanks to @exynos7 for spending the time to solve for all of us.

1 Like

A minor tweak that fixes the LSUV algorithm normalization

Note added: I found after writing this post that @stas independently discovered this a while ago

At the end of the 07a_lsuv.ipynb notebook, the means and stds of each layer are shown after the application of the LSUV algorithm, and we see that the means are not near zero.

There is a comment in the notebook: "Note that the mean doesn’t exactly stay at 0. since we change the standard deviation after by scaling the weight."

However, if in the lsuv_module you first scale the standard deviation, then correct the mean, the problem with the not-near-zero means is solved. This involves switching the order of two lines of code as follows:

Original version:

while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std
while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean

Modified version:

while mdl(xb) is not None and abs(h.mean) > 1e-3: m.bias -= h.mean
while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

In the next cell, we execute the lsuv initialization on all layers and print the means and standard deviations of the weights:

for m in mods: print(lsuv_module(m, xb))

Here is the output with the modified code:
(-2.3492136236313854e-08, 0.9999998807907104)
(2.5848951867857295e-09, 1.0)
(-1.7811544239521027e-08, 0.9999998807907104)
(9.778887033462524e-09, 0.9999999403953552)
(-1.30385160446167e-08, 1.0000001192092896)

The normalization is now perfect (i.e., within the specified precision) in both mean and standard deviation. However, fixing the mean to be near zero did not improve model accuracy.

1 Like