Lesson 10 Discussion & Wiki (2019)

This is something I’m a little suspicious of myself. I have some experience with systems where everything was turned into a “plug-in” of some kind (similar to how callbacks are used here), and while it may seem clean in the beginning, eventually you’ll run into the problem that now these callbacks need a way to work together and that this work must be orchestrated somehow. And because all this logic is spread out across many different classes, it can be hard to understand exactly how they interact. Simplifying the training loop itself and moving all the complexity elsewhere doesn’t mean the complexity is gone – and may actually introduce additional complexity because stuff isn’t in one place anymore. I’m not sure if this will be a problem here, but it requires a careful design.

1 Like

It’s been tricky to get right, but after implementing dozens of callbacks we’ve found things are working out great. Complex stuff that requires big projects like GANs, mixed precision training, etc are just a few lines of code in fastai.

2 Likes

Hi
I am working on video link annotations for Lesson 10 notebooks. I noticed that the current version of 05a_foundations.ipynb does not match the version presented in the video. There are some parts missing. In particular, cells related to discussion about partials and entire “Callbacks as callable classes” section is missing. Is it possible to commit a version that Jeremy Howard used for the lecture to the repository so that annotated notebook would match the video content?

What’s __constants__ = ['eps'] in LayerNorm in 07_batchnorm.ipynb?

I don’t seem to find such feature in python - I only found some references to it in pytorch jit discussions.

I noticed that the current version of 05a_foundations.ipynb does not match the version presented in the video.

Perhaps you have an outdated checkout and you need to update it? git pull
It’s all there: https://github.com/fastai/fastai_docs/blob/master/dev_course/dl2/05a_foundations.ipynb

You can also check the history of modifications - I looked - none of them removed anything. https://github.com/fastai/fastai_docs/commits/master/dev_course/dl2/05a_foundations.ipynb

I will check again, but I did git pull right before I started working on annotation.

Do: git log --oneline

and check that you’re on origin master and not a forked branch. Should say at this moment:

5b26a49 (HEAD -> master, origin/master, origin/HEAD) Merge branch 'master' of github.com:fastai/fastai_docs

But most likely you’re not on upstream master. Chances are that you forked the fastai_docs repo before those files were updated and your forked master and branch are out of sync with upstream. So when you run git pull you’re pulling from your forked master, which is out of sync with upstream master. https://docs.fast.ai/dev/git.html#how-to-keep-your-feature-branch-up-to-date

So those bits you are missing were added after you did the fork, hence the idea that they were removed. You’re “back in the future” - they weren’t added yet :wink:

torch tensor is throwing up some silly error, is anyone else seeing this as well ?

x.var((1,), keepdim=True)
Traceback (most recent call last):
File “”, line 1, in
TypeError: var(): argument ‘dim’ (position 1) must be int, not tuple

torch.version
‘1.0.1.post2’

I just did a fresh clone and the "Callbacks as callable classes " section is there, as it should be.

But when I check the log on the folder that I did a git pull on, I get exactly what I should get:

(base) serge@gpu:~/annotations/fastai_docs$ git log --oneline
5b26a49 (HEAD -> master, origin/master, origin/HEAD) Merge branch 'master' of github.com:fastai/fastai_docs

It is weird, isn’t it?

I can get right version from the freshly cloned repo folder, so we can call this issue closed, I guess. Thank you for the help!

Hmm, perhaps your repo is somehow misconfigured, what do you get when you run this in that folder?

echo $(git config --get remote.origin.url | sed 's|^.*//||; s/.*@//; s/[^:/]\+[:/]//; s/.git$//')/$(git branch | sed -n '/\* /s///p')

If you’re on bash, I highly recommend: https://docs.fast.ai/dev/git.html#bash-git-prompt which removes the guessing - it always tells you where you’re at. like cwd in your prompt, but git-wise.

(base) serge@gpu:~/annotations/fastai_docs$ git config --get remote.origin.url | sed 's|^.*//||; s/.*@//; s/[^:/]\+[:/]//; s/.git$//'
fastai/fastai_docs

and branch:

git branch | sed -n '/\* /s///p'

That’s why this lesson needs pytorch-nightly - see the first post of this topic for details.

pytorch 1.0.x’s torch.var doesn’t support a tuple, only int:
https://pytorch.org/docs/stable/torch.html#torch.var

pytorch 1.1.x’s (aka pytorch-nightly at the moment) does:
https://pytorch.org/docs/master/torch.html#torch.var

Oh, I see Jeremy already explained it here:

4 Likes

Thanks for the quick response.

1 Like

I can’t figure out the purpose of this code in RunningBatchNorm.forward (07_batchnorm.ipynb)

        if self.step<100:
            sums = sums / self.dbias
            sqrs = sqrs / self.dbias
            c    = c    / self.dbias
        means = sums/c
        vars = (sqrs/c).sub_(means*means)

sums/c and sqrs/c cancel out the first 4 lines in the snippet above (as their relative proportion doesn’t change) and these aren’t used again.

which also renders self.dbias useless other than to model mom1 after:
self.mom1 = self.dbias.new_tensor(mom1)
but why do we need self.mom1? it’s a temp calculation.
and self.step is no longer needed either.

Here is a cleaned up version:

class RunningBatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        self.mom,self.eps = mom,eps
        self.mults = nn.Parameter(torch.ones (nf,1,1))
        self.adds = nn.Parameter(torch.zeros(nf,1,1))
        self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.register_buffer('sqrs', torch.zeros(1,nf,1,1))
        self.register_buffer('batch', tensor(0.))
        self.register_buffer('count', tensor(0.))

    def update_stats(self, x):
        bs,nc,*_ = x.shape
        self.sums.detach_()
        self.sqrs.detach_()
        dims = (0,2,3)
        s = x.sum(dims, keepdim=True)
        ss = (x*x).sum(dims, keepdim=True)
        c = self.count.new_tensor(x.numel()/nc)
        mom1 = 1 - (1-self.mom)/math.sqrt(bs-1)
        self.sums.lerp_(s, mom1)
        self.sqrs.lerp_(ss, mom1)
        self.count.lerp_(c, mom1)
        self.batch += bs

    def forward(self, x):
        if self.training: self.update_stats(x)
        means = self.sums/self.count
        vars = (self.sqrs/self.count).sub_(means*means)
        if bool(self.batch < 20): vars.clamp_min_(0.01)
        x = (x-means).div_((vars.add_(self.eps)).sqrt())
        return x.mul_(self.mults).add_(self.adds)

I’m getting exactly the same accuracy results with this version, but it doesn’t mean it is so in generic case.

Unless the intention was to save the temp results in forward, and then this will be needed:

        if self.step<100:
            sums.div_(self.dbias)
            sqrs.div_(self.dbias)
            c.   div_(self.dbias)
        means = sums/c
        vars = (sqrs/c).sub_(means*means)

Since after doing:

sums = self.sums
sums = sums / self.dbias

sums is no longer an alias to self.sums

Just had to clarify for myself, when a = b stops being an alias in pytorch.

def dump(a, b, note): print(f"{note}\na={a}\nb={b}")
a = torch.ones(5)
b = a
dump(a, b, "init")

b = b + 1
dump(a, b, "+ new var")

b = a
b += 1
dump(a, b, "self referring +")

b = a
b.add_ = 1
dump(a, b, "add_")

gives:

init
a=tensor([1., 1., 1., 1., 1.])
b=tensor([1., 1., 1., 1., 1.])
+ new var
a=tensor([1., 1., 1., 1., 1.])
b=tensor([2., 2., 2., 2., 2.])
self referring +
a=tensor([2., 2., 2., 2., 2.])
b=tensor([2., 2., 2., 2., 2.])
add_
a=tensor([2., 2., 2., 2., 2.])
b=tensor([2., 2., 2., 2., 2.])

So b = b+1 is not affecting a.

1 Like

We can think about batch size is a parameter that influences search for minimum of loss function during gradual descent optimization. Since at the end of a batch processing usually there is an optimization step, there would be more steps of optimization per epoch with the smaller size batch. With the size of the batch equal to number of samples in the training set there would be only one optimization step because there would be only one batch. However this step would be made considering all available training data. Using all training data means that all information contained in the training data would be considered during computation of the step. However considering all information at once might not be a good idea because in multidimensional space of the loss function surface there might be crevices and folds leading to deeper valleys of the loss surface that would be missed with larger batch sizes. Reducing batch size allows gradual descent to explore tight spaces. On the other hand, reduced batch size reduces amount of information considered during gradual descent step, hence leading to steps in wrong direction sometimes. Achieving balance between precision of the gradient descent step and the curvature of the search space is what is being adjusted by finding right value for batch size hyper parameter.

3 Likes

Good point - you’re absolutely right. By both the count and the running stats are biased. But they’re biased in the same way, so you can use them directly and the two sets of bias end up canceling out! :slight_smile:

I’ve added an additional section to the notebook pointing this out now.

4 Likes

Please see my extra later notes here - perhaps you did mean to save those debiased calculations?

and also in the newly added section step is not needed (2 places: __init__ plus update_stats)

oh, and one more tweak, you placed the modified version after another snippets that changed bs=32, so this is no longer testing the same. Probably need to add:

data = DataBunch(*get_dls(train_ds, valid_ds, 2), c)

to compare apples to apples.

Thanks.

In the course, neither Layer norm nor Instance norm can train the network properly, I guess this may be because the appropriate learning rate is not used. So I did the following experiment by changing the learning rate.
1. Layer norm

learn,run = get_learn_run(nfs, data, 0.8, conv_ln, cbs=cbfs)
%time run.fit(1, learn)
-----------------------------------
train: [nan, tensor(0.1259, device='cuda:0')]
valid: [nan, tensor(0.0991, device='cuda:0')]

The result is very bad. Maybe, the learning rate too large (0.8)
So we decrease the lr (0.8 - > 0.1 ), and the network can train normally.

train: [0.581599375, tensor(0.8228, device='cuda:0')]
valid: [0.18959957275390624, tensor(0.9433, device='cuda:0')]

More,If we use one cycle learning rate train in signle epoch ,the result is very good, to 0.97 !

sched = combine_scheds([0.3,0.7], [sched_lin(5e-2,0.8), sched_lin(0.8,1e-2)])
--------------------------------
train: [0.5033986328125, tensor(0.8385, device='cuda:0')]
valid: [0.09999825439453125, tensor(0.9699, device='cuda:0')]

The results show that the Layer norm can make the network get good training results through a small
learning rate.

2. Instance norm
The learning rate of 0.1 used for Instance norm in the course. But, Whether using a smaller lr (1e-2), greater lr (0.9) or one cycle fit, the network can’t train properly.

------------------------------------------------
train: [nan, tensor(0.0986, device='cuda:0')]
valid: [nan, tensor(0.0991, device='cuda:0')]
1 Like

Right. Layer norm is OK, but less stable than batchnorm (at larger batch sizes) so you have to train at lower learning rates (e.g. if you use lr warmup then batchnorm can go to even higher lr. More importantly, it has problems at inference time as discussed.

2 Likes