Lesson 10 Discussion & Wiki (2019)

RawanSaifAldeen · April 7, 2019, 7:52pm

In notebook “07_batchnorm” in LayerNorm, shouldn’t it be var instead of std ?

def forward(self, x):
    m = x.mean((1,2,3), keepdim=True)
    v = x.std ((1,2,3), keepdim=True)
    x = (x-m) / ((v+self.eps).sqrt())
    return x*self.mult + self.add return x*self.mult + self.add

And I see that torch.var() does not take a tuple as dim, how can I get around this error?

MicPie · April 7, 2019, 7:57pm

I guess you are right with variance instead of the std (in the paper they have std^2).

For your torch.var() tuple error see:

jeremy · April 7, 2019, 8:44pm

Well spotted! Fixed now. (It didn’t change the outcome of the training FYI.)

jeremy · April 7, 2019, 9:16pm

No there’s no need - since numerator and denominator are always being debiased in the same way, they always cancel out.

Fixed - thanks.

That’s intentional - it can be compared to the What can we do in a single epoch? section.

ste · April 8, 2019, 12:15am

Pretend to be in a multi-ctegory classification: probably if you feed a MultiCategory with one element array in your label you’ll be fine

https://docs.fast.ai/core.html#MultiCategory

Pomo · April 8, 2019, 12:46am

Those cyclic spikes in the activation plots…

I think they are the validation set activations that are getting appended to the training activations. data.train_dl has length 98 minibatches. The spikes occur every 108 minibatches.

Was this already obvious to everyone but me?

jeremy · April 8, 2019, 1:23am

Nope - not obvious enough that I didn’t make the mistake in the code to remove validation from the pic!

Pomo · April 8, 2019, 1:42am

Mystery solved then. For a while I was looking for some weird chaotic dynamic.

amanmadaan · April 8, 2019, 4:31am

You may want to checkout this pull request and this thread .

t-v · April 8, 2019, 5:49am

And that’s exactly what __constants__ does (tell the JIT that things that are fixed number rather than a varying thing). I guess that’s part of :

Best regards

Thomas

AlisonDavey · April 8, 2019, 9:42am

If for a single epoch we first go through the training data (98 batches) and then the validation data (20 batches) then all we need to do is plot only the first 98 statistics - as you say remove validation from the pic.

To ensure we plot data only for the convolutional layers we can make this number a variable.

As shown in this notebook.

tamhash · April 8, 2019, 12:52pm

can someone explain purpose of “_order=0” in "class Callback(): " in more details?

charming · April 8, 2019, 1:12pm

·_order· represents the order in which callbacks are executed.

in Runner class，the smaller the ·_order·, the earlier it runs.

    def __call__(self, cb_name):
        res = False
        for cb in sorted(self.cbs, key=lambda x: x._order): res = cb(cb_name) and res
        return res

Seb · April 8, 2019, 3:34pm

Around 1:11:00, Jeremy talks about using larger filter sizes (e.g. 55 or 77) for the first convolution layer. So, we have for each convolution e.g. 773 = 147 inputs (3 channels), and 32 outputs (= # of channels)
If we used a 3*3 filter we would have 27 inputs and 32 outputs, which Jeremy says is “losing information”. Why would that be losing information when we have more outputs than inputs?

Is the general rule of thumb to have inputs be a multiple of outputs for conv layers?

So in the second conv layer, we have 3332 = 288 inputs and 64 outputs (assuming 64 channels) which means 3*3 is adapted if we double the # of channels. Maybe a 4 to 1 ratio is about right?

Can anyone provide some good intuition as to why we want to do this? Are we trying to insert the data into a funnel so that we can squeeze out the useful info out of it?

What about when we are decreasing the # of channels (e.g. on the “up” part of a Unet)? If we have a convolution with 512 input channels and 256 output channels and a 33 convolution, then we have 33*512 = 4608 inputs and 256 outputs. So even more squeezing, with a 20 to 1 ratio.

I find that it’s an interesting way to analyze model architectures, and I wonder if it could provide a way to improve them!

sergeman · April 8, 2019, 5:18pm

I am having a similar question, but perhaps a bit more generic. What is a good way of thinking about dimensions of layers in a network architecture?

When to use wider layers and when a narrower layer should be used?
How to decide on the depth of the network and shape of layers in it?
Unet uses a very peculiar shape and it works great for it. What was the thought process of the designers on Unet?

stas · April 8, 2019, 5:32pm

You’re using the term inputs in your “measurements”, but they aren’t inputs - but input activations (video 1:09:00)

I don’t have enough experience to try to explain this on my own, but after re-listening to the few minutes of video around 1:09-1:11, my understanding is that when Jeremy says “losing information” he really means “wasting information”. If you go from 9->8 with linear transformation that layer becomes almost useless, since it took almost 8 bits (9) and turned them back into 8 bits, when you go from 27->32 it’s wasteful since you took 27 bits and turned them into 32 bits (extrapolation).

So I think Jeremy is looking for some sort of sufficient compression and making the optimizer for the first conv layer work hard to pick kernel weights so that it’ll gather important info and discard unimportant info.

So he suggests imagenet does 147->32 and the class nb does 25->8 - so a compression of about 3-5 times. (also note that this is over 2 to 3 dimensions so the x3-5 is over a flattened size)

I don’t know if my intuition is a good one. But I think his main point was to keep the network design to be lean and mean.

On the other hand, those conv layers are immediately followed by ReLU, in which case it’s no longer a wasteful 1:1 re-mapping of bits, since we have non-linearity there. And this point is not clear to me. W/o ReLU what I wrote above makes sense, since then we just reshuffle bits, but with ReLU it’s no longer a linear re-mapping or a wasteful extrapolation.

jeremy · April 8, 2019, 5:54pm

Yes exactly - later in the video there was a question asked about this, and in my answer I said that “losing information” was a poor way for me to express it.

Even with the relu, I don’t think that having more channels out than receptive field in makes sense. But I’m not 100% sure. It needs some experiments and more thinking!

jeremy · April 8, 2019, 5:55pm

Great question - here’s a paper where they tried to answer that question:

Seb · April 8, 2019, 6:08pm

Thanks @stas . I think it’s an interesting subject to discuss. Makes sense to say that if we have an extrapolation we are being wasteful. Not sure if it’s a waste of data, maybe more a waste of computation and a waste of a layer?

Regarding ReLU: since it applies to each individual output activation, I believe it wouldn’t change the mapping between input and output activations, but instead just zero out the negative outputs.

Given that a compression of 4-5 is a bit of a standard, I find it interesting to look at different types of layers and see what kind of compression they imply:

The up part of Unet uses 20x compression - is it better/worse than trying to keep it at 4x? Would need some experimenting…
Trans convs:
http://deeplearning.net/software/theano/_images/padding_strides_odd_transposed.gif

Looking at the animated gif, we end up with compressions varying between 1 and 4, with an average around 1.8 (if I counted right).

If instead our upsampling consists in a 33 con with 4 times more channels , followed by a PixelShuffle(2), we have a 33/4 = 2.25 compression (part-2-lesson-14-2018)

Maybe worth dividing the output # of channels by 2 and get a bit more compression out of the upsampling layer, and fewer parameters in the process…

stas · April 8, 2019, 6:18pm

Yes, your definition is much more precise, @Seb, than “waste of data”, that I said. Thank you!

My feeling is that we are in the early days where we still need to tell DL boxes what to do - i.e. we still use algorithms, and it probably won’t be too long before architectures will be self-designing, so such questions would be a thing of a past. But we aren’t quite there yet, so we do need to understand such situations well so that we could do better.

Regarding ReLU: since it applies to each individual output activation, I believe it wouldn’t change the mapping between input and output activations, but instead just zero out the negative outputs.

But that’s the whole point of ReLU - it changes the mapping between input and output, don’t forget the gradients! If it’s somewhat linear still after the forward pass, it’s very non-linear after the backward pass.