Lesson 10 Discussion & Wiki (2019)

karthik.subraveti · April 6, 2019, 6:43pm

torch tensor is throwing up some silly error, is anyone else seeing this as well ?

x.var((1,), keepdim=True)
Traceback (most recent call last):
File “”, line 1, in
TypeError: var(): argument ‘dim’ (position 1) must be int, not tuple

torch.version
‘1.0.1.post2’

sergeman · April 6, 2019, 6:43pm

I just did a fresh clone and the "Callbacks as callable classes " section is there, as it should be.

But when I check the log on the folder that I did a git pull on, I get exactly what I should get:

(base) serge@gpu:~/annotations/fastai_docs$ git log --oneline
5b26a49 (HEAD -> master, origin/master, origin/HEAD) Merge branch 'master' of github.com:fastai/fastai_docs

It is weird, isn’t it?

I can get right version from the freshly cloned repo folder, so we can call this issue closed, I guess. Thank you for the help!

stas · April 6, 2019, 6:47pm

Hmm, perhaps your repo is somehow misconfigured, what do you get when you run this in that folder?

echo $(git config --get remote.origin.url | sed 's|^.*//||; s/.*@//; s/[^:/]\+[:/]//; s/.git$//')/$(git branch | sed -n '/\* /s///p')

If you’re on bash, I highly recommend: https://docs.fast.ai/dev/git.html#bash-git-prompt which removes the guessing - it always tells you where you’re at. like cwd in your prompt, but git-wise.

sergeman · April 6, 2019, 6:48pm

(base) serge@gpu:~/annotations/fastai_docs$ git config --get remote.origin.url | sed 's|^.*//||; s/.*@//; s/[^:/]\+[:/]//; s/.git$//'
fastai/fastai_docs

stas · April 6, 2019, 6:49pm

and branch:

git branch | sed -n '/\* /s///p'

stas · April 6, 2019, 6:54pm

That’s why this lesson needs pytorch-nightly - see the first post of this topic for details.

pytorch 1.0.x’s torch.var doesn’t support a tuple, only int:
https://pytorch.org/docs/stable/torch.html#torch.var

pytorch 1.1.x’s (aka pytorch-nightly at the moment) does:
https://pytorch.org/docs/master/torch.html#torch.var

Oh, I see Jeremy already explained it here:

karthik.subraveti · April 6, 2019, 6:58pm

Thanks for the quick response.

stas · April 6, 2019, 7:38pm

I can’t figure out the purpose of this code in RunningBatchNorm.forward (07_batchnorm.ipynb)

        if self.step<100:
            sums = sums / self.dbias
            sqrs = sqrs / self.dbias
            c    = c    / self.dbias
        means = sums/c
        vars = (sqrs/c).sub_(means*means)

sums/c and sqrs/c cancel out the first 4 lines in the snippet above (as their relative proportion doesn’t change) and these aren’t used again.

which also renders self.dbias useless other than to model mom1 after:
self.mom1 = self.dbias.new_tensor(mom1)
but why do we need self.mom1? it’s a temp calculation.
and self.step is no longer needed either.

Here is a cleaned up version:

class RunningBatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        self.mom,self.eps = mom,eps
        self.mults = nn.Parameter(torch.ones (nf,1,1))
        self.adds = nn.Parameter(torch.zeros(nf,1,1))
        self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.register_buffer('sqrs', torch.zeros(1,nf,1,1))
        self.register_buffer('batch', tensor(0.))
        self.register_buffer('count', tensor(0.))

    def update_stats(self, x):
        bs,nc,*_ = x.shape
        self.sums.detach_()
        self.sqrs.detach_()
        dims = (0,2,3)
        s = x.sum(dims, keepdim=True)
        ss = (x*x).sum(dims, keepdim=True)
        c = self.count.new_tensor(x.numel()/nc)
        mom1 = 1 - (1-self.mom)/math.sqrt(bs-1)
        self.sums.lerp_(s, mom1)
        self.sqrs.lerp_(ss, mom1)
        self.count.lerp_(c, mom1)
        self.batch += bs

    def forward(self, x):
        if self.training: self.update_stats(x)
        means = self.sums/self.count
        vars = (self.sqrs/self.count).sub_(means*means)
        if bool(self.batch < 20): vars.clamp_min_(0.01)
        x = (x-means).div_((vars.add_(self.eps)).sqrt())
        return x.mul_(self.mults).add_(self.adds)

I’m getting exactly the same accuracy results with this version, but it doesn’t mean it is so in generic case.

Unless the intention was to save the temp results in forward, and then this will be needed:

        if self.step<100:
            sums.div_(self.dbias)
            sqrs.div_(self.dbias)
            c.   div_(self.dbias)
        means = sums/c
        vars = (sqrs/c).sub_(means*means)

Since after doing:

sums = self.sums
sums = sums / self.dbias

sums is no longer an alias to self.sums

Just had to clarify for myself, when a = b stops being an alias in pytorch.

def dump(a, b, note): print(f"{note}\na={a}\nb={b}")
a = torch.ones(5)
b = a
dump(a, b, "init")

b = b + 1
dump(a, b, "+ new var")

b = a
b += 1
dump(a, b, "self referring +")

b = a
b.add_ = 1
dump(a, b, "add_")

gives:

init
a=tensor([1., 1., 1., 1., 1.])
b=tensor([1., 1., 1., 1., 1.])
+ new var
a=tensor([1., 1., 1., 1., 1.])
b=tensor([2., 2., 2., 2., 2.])
self referring +
a=tensor([2., 2., 2., 2., 2.])
b=tensor([2., 2., 2., 2., 2.])
add_
a=tensor([2., 2., 2., 2., 2.])
b=tensor([2., 2., 2., 2., 2.])

So b = b+1 is not affecting a.

sergeman · April 6, 2019, 8:21pm

We can think about batch size is a parameter that influences search for minimum of loss function during gradual descent optimization. Since at the end of a batch processing usually there is an optimization step, there would be more steps of optimization per epoch with the smaller size batch. With the size of the batch equal to number of samples in the training set there would be only one optimization step because there would be only one batch. However this step would be made considering all available training data. Using all training data means that all information contained in the training data would be considered during computation of the step. However considering all information at once might not be a good idea because in multidimensional space of the loss function surface there might be crevices and folds leading to deeper valleys of the loss surface that would be missed with larger batch sizes. Reducing batch size allows gradual descent to explore tight spaces. On the other hand, reduced batch size reduces amount of information considered during gradual descent step, hence leading to steps in wrong direction sometimes. Achieving balance between precision of the gradient descent step and the curvature of the search space is what is being adjusted by finding right value for batch size hyper parameter.

jeremy · April 6, 2019, 8:53pm

Good point - you’re absolutely right. By both the count and the running stats are biased. But they’re biased in the same way, so you can use them directly and the two sets of bias end up canceling out!

I’ve added an additional section to the notebook pointing this out now.

stas · April 6, 2019, 9:14pm

Please see my extra later notes here - perhaps you did mean to save those debiased calculations?

and also in the newly added section step is not needed (2 places: __init__ plus update_stats)

oh, and one more tweak, you placed the modified version after another snippets that changed bs=32, so this is no longer testing the same. Probably need to add:

data = DataBunch(*get_dls(train_ds, valid_ds, 2), c)

to compare apples to apples.

Thanks.

charming · April 7, 2019, 1:48pm

In the course, neither Layer norm nor Instance norm can train the network properly, I guess this may be because the appropriate learning rate is not used. So I did the following experiment by changing the learning rate.
1. Layer norm

learn,run = get_learn_run(nfs, data, 0.8, conv_ln, cbs=cbfs)
%time run.fit(1, learn)
-----------------------------------
train: [nan, tensor(0.1259, device='cuda:0')]
valid: [nan, tensor(0.0991, device='cuda:0')]

The result is very bad. Maybe, the learning rate too large (0.8)
So we decrease the lr (0.8 - > 0.1 ), and the network can train normally.

train: [0.581599375, tensor(0.8228, device='cuda:0')]
valid: [0.18959957275390624, tensor(0.9433, device='cuda:0')]

More，If we use one cycle learning rate train in signle epoch ,the result is very good, to 0.97 !

sched = combine_scheds([0.3,0.7], [sched_lin(5e-2,0.8), sched_lin(0.8,1e-2)])
--------------------------------
train: [0.5033986328125, tensor(0.8385, device='cuda:0')]
valid: [0.09999825439453125, tensor(0.9699, device='cuda:0')]

The results show that the Layer norm can make the network get good training results through a small
learning rate.

2. Instance norm
The learning rate of 0.1 used for Instance norm in the course. But, Whether using a smaller lr (1e-2), greater lr (0.9) or one cycle fit, the network can’t train properly.

------------------------------------------------
train: [nan, tensor(0.0986, device='cuda:0')]
valid: [nan, tensor(0.0991, device='cuda:0')]

jeremy · April 7, 2019, 1:55pm

Right. Layer norm is OK, but less stable than batchnorm (at larger batch sizes) so you have to train at lower learning rates (e.g. if you use lr warmup then batchnorm can go to even higher lr. More importantly, it has problems at inference time as discussed.

RawanSaifAldeen · April 7, 2019, 7:52pm

In notebook “07_batchnorm” in LayerNorm, shouldn’t it be var instead of std ?

def forward(self, x):
    m = x.mean((1,2,3), keepdim=True)
    v = x.std ((1,2,3), keepdim=True)
    x = (x-m) / ((v+self.eps).sqrt())
    return x*self.mult + self.add return x*self.mult + self.add

And I see that torch.var() does not take a tuple as dim, how can I get around this error?

MicPie · April 7, 2019, 7:57pm

I guess you are right with variance instead of the std (in the paper they have std^2).

For your torch.var() tuple error see:

jeremy · April 7, 2019, 8:44pm

Well spotted! Fixed now. (It didn’t change the outcome of the training FYI.)

jeremy · April 7, 2019, 9:16pm

No there’s no need - since numerator and denominator are always being debiased in the same way, they always cancel out.

Fixed - thanks.

That’s intentional - it can be compared to the What can we do in a single epoch? section.

ste · April 8, 2019, 12:15am

Pretend to be in a multi-ctegory classification: probably if you feed a MultiCategory with one element array in your label you’ll be fine

https://docs.fast.ai/core.html#MultiCategory

Pomo · April 8, 2019, 12:46am

Those cyclic spikes in the activation plots…

I think they are the validation set activations that are getting appended to the training activations. data.train_dl has length 98 minibatches. The spikes occur every 108 minibatches.

Was this already obvious to everyone but me?

jeremy · April 8, 2019, 1:23am

Nope - not obvious enough that I didn’t make the mistake in the code to remove validation from the pic!