Lesson 10 Discussion & Wiki (2019)

(base) serge@gpu:~/annotations/fastai_docs$ git config --get remote.origin.url | sed 's|^.*//||; s/.*@//; s/[^:/]\+[:/]//; s/.git$//'
fastai/fastai_docs

and branch:

git branch | sed -n '/\* /s///p'

That’s why this lesson needs pytorch-nightly - see the first post of this topic for details.

pytorch 1.0.x’s torch.var doesn’t support a tuple, only int:
https://pytorch.org/docs/stable/torch.html#torch.var

pytorch 1.1.x’s (aka pytorch-nightly at the moment) does:
https://pytorch.org/docs/master/torch.html#torch.var

Oh, I see Jeremy already explained it here:

4 Likes

Thanks for the quick response.

1 Like

I can’t figure out the purpose of this code in RunningBatchNorm.forward (07_batchnorm.ipynb)

        if self.step<100:
            sums = sums / self.dbias
            sqrs = sqrs / self.dbias
            c    = c    / self.dbias
        means = sums/c
        vars = (sqrs/c).sub_(means*means)

sums/c and sqrs/c cancel out the first 4 lines in the snippet above (as their relative proportion doesn’t change) and these aren’t used again.

which also renders self.dbias useless other than to model mom1 after:
self.mom1 = self.dbias.new_tensor(mom1)
but why do we need self.mom1? it’s a temp calculation.
and self.step is no longer needed either.

Here is a cleaned up version:

class RunningBatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        self.mom,self.eps = mom,eps
        self.mults = nn.Parameter(torch.ones (nf,1,1))
        self.adds = nn.Parameter(torch.zeros(nf,1,1))
        self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.register_buffer('sqrs', torch.zeros(1,nf,1,1))
        self.register_buffer('batch', tensor(0.))
        self.register_buffer('count', tensor(0.))

    def update_stats(self, x):
        bs,nc,*_ = x.shape
        self.sums.detach_()
        self.sqrs.detach_()
        dims = (0,2,3)
        s = x.sum(dims, keepdim=True)
        ss = (x*x).sum(dims, keepdim=True)
        c = self.count.new_tensor(x.numel()/nc)
        mom1 = 1 - (1-self.mom)/math.sqrt(bs-1)
        self.sums.lerp_(s, mom1)
        self.sqrs.lerp_(ss, mom1)
        self.count.lerp_(c, mom1)
        self.batch += bs

    def forward(self, x):
        if self.training: self.update_stats(x)
        means = self.sums/self.count
        vars = (self.sqrs/self.count).sub_(means*means)
        if bool(self.batch < 20): vars.clamp_min_(0.01)
        x = (x-means).div_((vars.add_(self.eps)).sqrt())
        return x.mul_(self.mults).add_(self.adds)

I’m getting exactly the same accuracy results with this version, but it doesn’t mean it is so in generic case.

Unless the intention was to save the temp results in forward, and then this will be needed:

        if self.step<100:
            sums.div_(self.dbias)
            sqrs.div_(self.dbias)
            c.   div_(self.dbias)
        means = sums/c
        vars = (sqrs/c).sub_(means*means)

Since after doing:

sums = self.sums
sums = sums / self.dbias

sums is no longer an alias to self.sums

Just had to clarify for myself, when a = b stops being an alias in pytorch.

def dump(a, b, note): print(f"{note}\na={a}\nb={b}")
a = torch.ones(5)
b = a
dump(a, b, "init")

b = b + 1
dump(a, b, "+ new var")

b = a
b += 1
dump(a, b, "self referring +")

b = a
b.add_ = 1
dump(a, b, "add_")

gives:

init
a=tensor([1., 1., 1., 1., 1.])
b=tensor([1., 1., 1., 1., 1.])
+ new var
a=tensor([1., 1., 1., 1., 1.])
b=tensor([2., 2., 2., 2., 2.])
self referring +
a=tensor([2., 2., 2., 2., 2.])
b=tensor([2., 2., 2., 2., 2.])
add_
a=tensor([2., 2., 2., 2., 2.])
b=tensor([2., 2., 2., 2., 2.])

So b = b+1 is not affecting a.

1 Like

We can think about batch size is a parameter that influences search for minimum of loss function during gradual descent optimization. Since at the end of a batch processing usually there is an optimization step, there would be more steps of optimization per epoch with the smaller size batch. With the size of the batch equal to number of samples in the training set there would be only one optimization step because there would be only one batch. However this step would be made considering all available training data. Using all training data means that all information contained in the training data would be considered during computation of the step. However considering all information at once might not be a good idea because in multidimensional space of the loss function surface there might be crevices and folds leading to deeper valleys of the loss surface that would be missed with larger batch sizes. Reducing batch size allows gradual descent to explore tight spaces. On the other hand, reduced batch size reduces amount of information considered during gradual descent step, hence leading to steps in wrong direction sometimes. Achieving balance between precision of the gradient descent step and the curvature of the search space is what is being adjusted by finding right value for batch size hyper parameter.

3 Likes

Good point - you’re absolutely right. By both the count and the running stats are biased. But they’re biased in the same way, so you can use them directly and the two sets of bias end up canceling out! :slight_smile:

I’ve added an additional section to the notebook pointing this out now.

4 Likes

Please see my extra later notes here - perhaps you did mean to save those debiased calculations?

and also in the newly added section step is not needed (2 places: __init__ plus update_stats)

oh, and one more tweak, you placed the modified version after another snippets that changed bs=32, so this is no longer testing the same. Probably need to add:

data = DataBunch(*get_dls(train_ds, valid_ds, 2), c)

to compare apples to apples.

Thanks.

In the course, neither Layer norm nor Instance norm can train the network properly, I guess this may be because the appropriate learning rate is not used. So I did the following experiment by changing the learning rate.
1. Layer norm

learn,run = get_learn_run(nfs, data, 0.8, conv_ln, cbs=cbfs)
%time run.fit(1, learn)
-----------------------------------
train: [nan, tensor(0.1259, device='cuda:0')]
valid: [nan, tensor(0.0991, device='cuda:0')]

The result is very bad. Maybe, the learning rate too large (0.8)
So we decrease the lr (0.8 - > 0.1 ), and the network can train normally.

train: [0.581599375, tensor(0.8228, device='cuda:0')]
valid: [0.18959957275390624, tensor(0.9433, device='cuda:0')]

More,If we use one cycle learning rate train in signle epoch ,the result is very good, to 0.97 !

sched = combine_scheds([0.3,0.7], [sched_lin(5e-2,0.8), sched_lin(0.8,1e-2)])
--------------------------------
train: [0.5033986328125, tensor(0.8385, device='cuda:0')]
valid: [0.09999825439453125, tensor(0.9699, device='cuda:0')]

The results show that the Layer norm can make the network get good training results through a small
learning rate.

2. Instance norm
The learning rate of 0.1 used for Instance norm in the course. But, Whether using a smaller lr (1e-2), greater lr (0.9) or one cycle fit, the network can’t train properly.

------------------------------------------------
train: [nan, tensor(0.0986, device='cuda:0')]
valid: [nan, tensor(0.0991, device='cuda:0')]
1 Like

Right. Layer norm is OK, but less stable than batchnorm (at larger batch sizes) so you have to train at lower learning rates (e.g. if you use lr warmup then batchnorm can go to even higher lr. More importantly, it has problems at inference time as discussed.

2 Likes

In notebook “07_batchnorm” in LayerNorm, shouldn’t it be var instead of std ?

def forward(self, x):
    m = x.mean((1,2,3), keepdim=True)
    v = x.std ((1,2,3), keepdim=True)
    x = (x-m) / ((v+self.eps).sqrt())
    return x*self.mult + self.add return x*self.mult + self.add

And I see that torch.var() does not take a tuple as dim, how can I get around this error?

2 Likes

I guess you are right with variance instead of the std (in the paper they have std^2).

For your torch.var() tuple error see:

1 Like

Well spotted! Fixed now. (It didn’t change the outcome of the training FYI.)

1 Like

No there’s no need - since numerator and denominator are always being debiased in the same way, they always cancel out.

Fixed - thanks.

That’s intentional - it can be compared to the What can we do in a single epoch? section.

1 Like

Pretend to be in a multi-ctegory classification: probably if you feed a MultiCategory with one element array in your label you’ll be fine :wink:

https://docs.fast.ai/core.html#MultiCategory

Those cyclic spikes in the activation plots…

I think they are the validation set activations that are getting appended to the training activations. data.train_dl has length 98 minibatches. The spikes occur every 108 minibatches.

Was this already obvious to everyone but me?

3 Likes

Nope - not obvious enough that I didn’t make the mistake in the code to remove validation from the pic!

1 Like

Mystery solved then. For a while I was looking for some weird chaotic dynamic.

You may want to checkout this pull request and this thread :slight_smile: .

And that’s exactly what __constants__ does (tell the JIT that things that are fixed number rather than a varying thing). I guess that’s part of :

Best regards

Thomas

1 Like