Lesson 10 Discussion & Wiki (2019)

jeremy · April 7, 2019, 9:16pm

No there’s no need - since numerator and denominator are always being debiased in the same way, they always cancel out.

Fixed - thanks.

That’s intentional - it can be compared to the What can we do in a single epoch? section.

ste · April 8, 2019, 12:15am

Pretend to be in a multi-ctegory classification: probably if you feed a MultiCategory with one element array in your label you’ll be fine

https://docs.fast.ai/core.html#MultiCategory

Pomo · April 8, 2019, 12:46am

Those cyclic spikes in the activation plots…

I think they are the validation set activations that are getting appended to the training activations. data.train_dl has length 98 minibatches. The spikes occur every 108 minibatches.

Was this already obvious to everyone but me?

jeremy · April 8, 2019, 1:23am

Nope - not obvious enough that I didn’t make the mistake in the code to remove validation from the pic!

Pomo · April 8, 2019, 1:42am

Mystery solved then. For a while I was looking for some weird chaotic dynamic.

amanmadaan · April 8, 2019, 4:31am

You may want to checkout this pull request and this thread .

t-v · April 8, 2019, 5:49am

And that’s exactly what __constants__ does (tell the JIT that things that are fixed number rather than a varying thing). I guess that’s part of :

Best regards

Thomas

AlisonDavey · April 8, 2019, 9:42am

If for a single epoch we first go through the training data (98 batches) and then the validation data (20 batches) then all we need to do is plot only the first 98 statistics - as you say remove validation from the pic.

To ensure we plot data only for the convolutional layers we can make this number a variable.

As shown in this notebook.

tamhash · April 8, 2019, 12:52pm

can someone explain purpose of “_order=0” in "class Callback(): " in more details?

charming · April 8, 2019, 1:12pm

·_order· represents the order in which callbacks are executed.

in Runner class，the smaller the ·_order·, the earlier it runs.

    def __call__(self, cb_name):
        res = False
        for cb in sorted(self.cbs, key=lambda x: x._order): res = cb(cb_name) and res
        return res

Seb · April 8, 2019, 3:34pm

Around 1:11:00, Jeremy talks about using larger filter sizes (e.g. 55 or 77) for the first convolution layer. So, we have for each convolution e.g. 773 = 147 inputs (3 channels), and 32 outputs (= # of channels)
If we used a 3*3 filter we would have 27 inputs and 32 outputs, which Jeremy says is “losing information”. Why would that be losing information when we have more outputs than inputs?

Is the general rule of thumb to have inputs be a multiple of outputs for conv layers?

So in the second conv layer, we have 3332 = 288 inputs and 64 outputs (assuming 64 channels) which means 3*3 is adapted if we double the # of channels. Maybe a 4 to 1 ratio is about right?

Can anyone provide some good intuition as to why we want to do this? Are we trying to insert the data into a funnel so that we can squeeze out the useful info out of it?

What about when we are decreasing the # of channels (e.g. on the “up” part of a Unet)? If we have a convolution with 512 input channels and 256 output channels and a 33 convolution, then we have 33*512 = 4608 inputs and 256 outputs. So even more squeezing, with a 20 to 1 ratio.

I find that it’s an interesting way to analyze model architectures, and I wonder if it could provide a way to improve them!

sergeman · April 8, 2019, 5:18pm

I am having a similar question, but perhaps a bit more generic. What is a good way of thinking about dimensions of layers in a network architecture?

When to use wider layers and when a narrower layer should be used?
How to decide on the depth of the network and shape of layers in it?
Unet uses a very peculiar shape and it works great for it. What was the thought process of the designers on Unet?

stas · April 8, 2019, 5:32pm

You’re using the term inputs in your “measurements”, but they aren’t inputs - but input activations (video 1:09:00)

I don’t have enough experience to try to explain this on my own, but after re-listening to the few minutes of video around 1:09-1:11, my understanding is that when Jeremy says “losing information” he really means “wasting information”. If you go from 9->8 with linear transformation that layer becomes almost useless, since it took almost 8 bits (9) and turned them back into 8 bits, when you go from 27->32 it’s wasteful since you took 27 bits and turned them into 32 bits (extrapolation).

So I think Jeremy is looking for some sort of sufficient compression and making the optimizer for the first conv layer work hard to pick kernel weights so that it’ll gather important info and discard unimportant info.

So he suggests imagenet does 147->32 and the class nb does 25->8 - so a compression of about 3-5 times. (also note that this is over 2 to 3 dimensions so the x3-5 is over a flattened size)

I don’t know if my intuition is a good one. But I think his main point was to keep the network design to be lean and mean.

On the other hand, those conv layers are immediately followed by ReLU, in which case it’s no longer a wasteful 1:1 re-mapping of bits, since we have non-linearity there. And this point is not clear to me. W/o ReLU what I wrote above makes sense, since then we just reshuffle bits, but with ReLU it’s no longer a linear re-mapping or a wasteful extrapolation.

jeremy · April 8, 2019, 5:54pm

Yes exactly - later in the video there was a question asked about this, and in my answer I said that “losing information” was a poor way for me to express it.

Even with the relu, I don’t think that having more channels out than receptive field in makes sense. But I’m not 100% sure. It needs some experiments and more thinking!

jeremy · April 8, 2019, 5:55pm

Great question - here’s a paper where they tried to answer that question:

Seb · April 8, 2019, 6:08pm

Thanks @stas . I think it’s an interesting subject to discuss. Makes sense to say that if we have an extrapolation we are being wasteful. Not sure if it’s a waste of data, maybe more a waste of computation and a waste of a layer?

Regarding ReLU: since it applies to each individual output activation, I believe it wouldn’t change the mapping between input and output activations, but instead just zero out the negative outputs.

Given that a compression of 4-5 is a bit of a standard, I find it interesting to look at different types of layers and see what kind of compression they imply:

The up part of Unet uses 20x compression - is it better/worse than trying to keep it at 4x? Would need some experimenting…
Trans convs:
http://deeplearning.net/software/theano/_images/padding_strides_odd_transposed.gif

Looking at the animated gif, we end up with compressions varying between 1 and 4, with an average around 1.8 (if I counted right).

If instead our upsampling consists in a 33 con with 4 times more channels , followed by a PixelShuffle(2), we have a 33/4 = 2.25 compression (part-2-lesson-14-2018)

Maybe worth dividing the output # of channels by 2 and get a bit more compression out of the upsampling layer, and fewer parameters in the process…

stas · April 8, 2019, 6:18pm

Yes, your definition is much more precise, @Seb, than “waste of data”, that I said. Thank you!

My feeling is that we are in the early days where we still need to tell DL boxes what to do - i.e. we still use algorithms, and it probably won’t be too long before architectures will be self-designing, so such questions would be a thing of a past. But we aren’t quite there yet, so we do need to understand such situations well so that we could do better.

Regarding ReLU: since it applies to each individual output activation, I believe it wouldn’t change the mapping between input and output activations, but instead just zero out the negative outputs.

But that’s the whole point of ReLU - it changes the mapping between input and output, don’t forget the gradients! If it’s somewhat linear still after the forward pass, it’s very non-linear after the backward pass.

t-v · April 8, 2019, 7:09pm

One way to look at it might be that having more outputs than inputs means diluting information.

If you think about this not for images and convolutions but in terms of linear layers, I think it is easier to understand why it isn’t that useful to have more outputs than inputs here: A linear layer of, say, 10d inputs and 100d outputs will map the 10d input space into a 10d subspace of the 100d output space. As such, you have an inefficient representation of a 10d space. (Of course, if you have a ReLU nonlinearity, you have +/- so you might reasonably expect to fill 20d with positive numbers, but 100d still is wasteful.)
But convolutions are particularly linear maps between patches (e.g. look at the torch.nn.Unfold example), i.e. you have (in-channels * w * h) inputs and out-channels outputs, so, as the dimension counting works as you (and Jeremy) propose.

Now we ask for out-channels < in-channels * w * h if we want the output to be a somehow compressed representation (an extreme case for linear is word vectors or other dense representation of categorical features). However, in the up part of a Unet, I would think that we are not really compressing information any more.

You could probably reason about what dimension the space of “reasonable images” has (much lower than pixel-count) and that you’d need to come from a representation that actually resolves this.
It’d be very interesting to see if one can find a pattern of how much is a good ratio and where.

At some point, things like stride vs. correlation of adjacent outputs and pooling probably also play a role.

Best regards

Thomas

jeremy · April 8, 2019, 7:31pm

Wow that’s fantastic!

Why not?

t-v · April 8, 2019, 8:05pm

Thanks!

I would think you would not have compression in the up sampling part of UNet because the dimension of “reasonable” outputs is typically lower than the number of pixels. For segmentation maps that certainly seems to be the case. For natural images my mental model is wavelet compression, which - albeit lossy - seems to hint at a limited output space. As such, if we have an efficient representation of what’s going on, my intuition would be that we have some sort of decompression because the output isn’t as efficient.