I had the same Idea but it didn’t work for me. When I run:
%time run.fit(2, learn)
after the weight initialization sometimes it works well but half of the time the loss either explodes or vanishes.
Did you attempted to train the model after the initialization? How did that work for you?
Bah! You’re right, @Kagan! It doesn’t train!
Indeed I only tested the stats and assumed the rest would just work! Thank you for checking that my suggestion was wrong!
I tried to renormalize the stats part again, so std+bias+std, but no, that doesn’t work either.
Rerunning lsuv_module
twice destroyes training too!
for m in mods: print(lsuv_module(m, xb), lsuv_module(m, xb))
Anybody has an idea why? It looks like it’s very important that the bias is not zerocentered!
It was zeroed in their original pytorch impl anyway (github). In my experiments zero init was as good as fast.aistyle
Would it be possible to have a look at your full code/telemetry? Curious if we ran into the same thing
Had a similar symptom 1/2 of the trainings
Fixed by reducing LR from 1 to 0.4, here’s how that looked on telemetry
It’s just 07a_lsuv.ipynb
as is.
Here is another variant I tried, attempting to balance the adjustments, while having std=1, mean=0 postlsuv_module
:
def lsuv_module(m, xb):
h = Hook(m, append_stat)
while mdl(xb) is not None and (abs(mean) > 1e3 or abs(std1) > 1e3):
mean,std = h.mean,h.std
if abs(mean) > 1e3: m.bias = mean
if abs(std1) > 1e3: m.weight.data /= std
h.remove()
return h.mean,h.std
(note: it recalculates mean/std twice, but I didn’t bother refactoring, since it’s just a proof of concept.)
it works better (i.e. trains), but still getting nan
s every so often. This is with the default lr=0.6
of that nb.
But, of course, the original nb gets nan
s too every so often, so that learning rate is just too high.
With a lower lr=0.1
the original reversed order std+bias approach trains just fine too.
Here is a refactored “balanced” version:
def lsuv_module(m, xb):
h = Hook(m, append_stat)
while mdl(xb) is not None:
mean,std = h.mean, h.std
if abs(mean) > 1e3 or abs(std1) > 1e3 :
m.bias = mean
m.weight.data.div_(std)
else: break
h.remove()
return h.mean,h.std
Perhaps self.sub
in GeneralReLU needs to be a parameter and then the init will only affect the initial setting and then let the network tune it up.
Thank you for checking this out. it’s good to know that we can mitigate the exploitation/vanishing by reducing the learning rate.
It’s still kind off weird that this happening though, I’d expect lsuv std+bais to perform the best since it have the “perfect statistics”.
I thought that it might be related to the distribution of the weights after the initialization (having 0 mean, 11 std, but being skewed) so I plot a histogram after the initialization
for m in mods: plt.figure(); plt.hist(m.weight.view(1).detach().cpu(), bins = 20)
but it looks pretty much the same when model trains normally and when it goes to NaN
I played some more with 07a_lsuv nb. Here are some observations/notes:

The
sub
argument shouldn’t be configurable as it gets reset to a value relative to the batch’s mean regardless of its initial value. (Unless, it’s meant to be used by some other way w/o lsuv, but it’d be very difficult to manually choose, as it varies from layer to layer with lsuv.)To prove that it doesn’t need to be configurable, fix the seed and rerun the nb once with
sub
set to 0 and then to 50, adding its value to the return list  afterlsuv_module
is run m.relu.sub
ends up being exactly the same value, regardless of its initial value.
class ConvLayer(nn.Module):
def __init__(self, ni, nf, ks=3, stride=2, sub=50., **kwargs):
^^^^^^^
[...]
def lsuv_module(m, xb):
[...]
return m.relu.sub, h.mean, h.std
^^^^^^^^^^

making
sub
a parameter didn’t lead to improvements, but made things worse in my experiments. the value ofsub
seems to be a very sensitive one. 
this implementation of lsuv doesn’t check whether variance is tiny (no eps) or undefined (small bs w/ no variance) before dividing by it  it tests with bs=512 which won’t have any of these issues, which is far from a general case.
using bs=2 requires a much much lower lr

While experimenting I used a random reproducible seed, so it was helpful to analyse closer the cases where the network wasn’t training (so that I could turn different parts on/off). Most of the time lsuv seemed to be the culprit  so it is helpful in general, but also leads to
nan
s at times at the lr used in the nb.
Also note that the original LSUV doesn’t tweak the mean, only the std. But w/o the mean tweak in the lesson nb, things don’t perform as well. So this is a bonus. And the nb version doesn’t implement the optional orthonormal init.
Hi,
According to 10_augmentation.ipynb
, using numpy is supposed to result in faster tensor creation, but that’s clearly not happening for me. Not sure what could be going on; I repeated this test with 2 more images:
That’s interesting. I guess it might depend on the Pillow, pytorch, and numpy versions…
Just a quick really pedantic domain specific comment on RandomResizeCrop which from lesson sounds like fastai is looking to replace with very smart perspective “warp” transform.
Jeremy made comments that objects are never wider or thinner in the real world and that RandomResizeCrop is perhaps trying to account for changes in perspective.
Knowing the behavior of camera lenses RandomResizeCrop still serves a purpose. For example, some lenses will result in faces in portraits (i.e. objects) appear wider while others will make it appear thinner.
See site as first google search reference on this. https://mcpactions.com/2014/05/19/perfectportraitlens/
In short. I think a combination of both could be useful.
From my understanding of lens distortion that’s not quite true  they don’t squish an entire picture just horizontally or vertically. Or at least not enough to make the massive squishing of imagenet transforms sensible.
in the code of this lesson 11, is there any dropout being applied by pytorch undernearth during training? or does it have to be explicitely called for? thank you
Here is something I’m confused about:
At around 01:22:00 in the lesson, talking about momentum, Jeremy says that if you have 10 million activations, you need to store 10 million floats. Did he mean weights instead of activations? Because a bit before that he called them parameters (which, as I understand, are the same as weights), and a bit before that he also called the activations (which are very different).
Thanks.
Probably I have missed something, but I do not understand why we choose kernel size 3 insted of 5 for the first layer? If that is something proven by research, can you explain me why Jeremy has choosen the number of channels for the first layer to be c_inx3x3. I did not undestant that part neither
Actually I think it is in this other „bag of tricks“ paper, somewhat inflationary use of the term ;):
Page 5:
The observation is that the computational cost of a convolution is quadratic to the kernel width or height. A 7 × 7 convolution is 5.4 times more expensive than a 3 × 3 convolution. So this tweak replacing the 7 × 7 convolution in the input stem with three conservative 3 × 3 convolutions […]
Oops I copied from Google without checking  deleted my reply so others won’t be confused.
Dear @jeremy, thank you sooo much for that moment:
" we show that L2 reg has no regularisation effect…WHAT??? … "
" you know how I keep mentioning how none of us know what they are doing… that should make you feel better about ‘can I contribute to DL’ …? "
It does. I love your way of teaching, encouraging us, knocking down the barriers of entry to deep learning and gifting us all these tips and tools, but THAT WAS BY FAR THE BEST, MOST ENCOURAGING MOMENT so far for me. I was just thinking “my head is about to explode with all this info, I need a walk, fresh air” and boom you dropped the mic.
Thanks again and please keep in coming,
Lamine Gaye
PS: my 1st post on the forums… I just had to say this
At around 28:00, Jeremy talks about get_files()
, and how fast it is. I was intrigued, and decided to try to recreate it locally and experiment. I wanted to start off with a naive version and see what sort of speed ups I could achieve:
def get_fnames(train=False):
path = Path('/Users/daniel/.fastai/data/imagenette160')
if train:
path = path/'train'
else:
path = path/'val'
fnames = []
for _dir in path.ls():
for fname in _dir.ls():
fnames.append(fname)
return fnames
>>> fnames = get_fnames(train=True)
>>> len(fnames)
12894
>>> timeit n 10 get_fnames(train=True)
34.9 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> fnames[:3]
[PosixPath('/Users/daniel/.fastai/data/imagenette160/train/n03394916/n03394916_58454.JPEG'),
PosixPath('/Users/daniel/.fastai/data/imagenette160/train/n03394916/n03394916_32588.JPEG'),
PosixPath('/Users/daniel/.fastai/data/imagenette160/train/n03394916/n03394916_32422.JPEG')]
In the video, he shows ~70ms runtime, whereas I’m getting ~35, about twice as fast. This suggests I’ve misunderstood the task at hand or something … can anyone shed any light on what’s going on? Are we in fact doing the same thing? Why does my timing show a faster speed?
Don’t see this mentioned here in Lesson 11 thread, so adding a link here.
If you get the following errors when running 08_data_block.ipynb

cos_1cycle_anneal
not defined 
Runner
does not havein_train
Link to forum thread discussing this and the solution  thanks to @exynos7 for spending the time to solve for all of us.
A minor tweak that fixes the LSUV algorithm normalization
Note added: I found after writing this post that @stas independently discovered this a while ago
At the end of the 07a_lsuv.ipynb notebook, the means and stds of each layer are shown after the application of the LSUV algorithm, and we see that the means are not near zero.
There is a comment in the notebook: "Note that the mean doesn’t exactly stay at 0. since we change the standard deviation after by scaling the weight."
However, if in the lsuv_module
you first scale the standard deviation, then correct the mean, the problem with the notnearzero means is solved. This involves switching the order of two lines of code as follows:
Original version:
while mdl(xb) is not None and abs(h.std1) > 1e3: m.weight.data /= h.std
while mdl(xb) is not None and abs(h.mean) > 1e3: m.bias = h.mean
Modified version:
while mdl(xb) is not None and abs(h.mean) > 1e3: m.bias = h.mean
while mdl(xb) is not None and abs(h.std1) > 1e3: m.weight.data /= h.std
In the next cell, we execute the lsuv initialization on all layers and print the means and standard deviations of the weights:
for m in mods: print(lsuv_module(m, xb))
Here is the output with the modified code:
(2.3492136236313854e08, 0.9999998807907104)
(2.5848951867857295e09, 1.0)
(1.7811544239521027e08, 0.9999998807907104)
(9.778887033462524e09, 0.9999999403953552)
(1.30385160446167e08, 1.0000001192092896)
The normalization is now perfect (i.e., within the specified precision) in both mean and standard deviation. However, fixing the mean to be near zero did not improve model accuracy.