No thereâs no need - since numerator and denominator are always being debiased in the same way, they always cancel out.
Fixed - thanks.
Thatâs intentional - it can be compared to the What can we do in a single epoch? section.
No thereâs no need - since numerator and denominator are always being debiased in the same way, they always cancel out.
Fixed - thanks.
Thatâs intentional - it can be compared to the What can we do in a single epoch? section.
Pretend to be in a multi-ctegory classification: probably if you feed a MultiCategory with one element array in your label youâll be fine
Those cyclic spikes in the activation plotsâŚ
I think they are the validation set activations that are getting appended to the training activations. data.train_dl has length 98 minibatches. The spikes occur every 108 minibatches.
Was this already obvious to everyone but me?
Nope - not obvious enough that I didnât make the mistake in the code to remove validation from the pic!
Mystery solved then. For a while I was looking for some weird chaotic dynamic.
And thatâs exactly what __constants__
does (tell the JIT that things that are fixed number rather than a varying thing). I guess thatâs part of :
Best regards
Thomas
If for a single epoch we first go through the training data (98 batches) and then the validation data (20 batches) then all we need to do is plot only the first 98 statistics - as you say remove validation from the pic.
To ensure we plot data only for the convolutional layers we can make this number a variable.
As shown in this notebook.
can someone explain purpose of â_order=0â in "class Callback(): " in more details?
¡_order¡ represents the order in which callbacks are executed.
in Runner classďźthe smaller the ¡_order¡, the earlier it runs.
def __call__(self, cb_name):
res = False
for cb in sorted(self.cbs, key=lambda x: x._order): res = cb(cb_name) and res
return res
Around 1:11:00, Jeremy talks about using larger filter sizes (e.g. 55 or 77) for the first convolution layer. So, we have for each convolution e.g. 773 = 147 inputs (3 channels), and 32 outputs (= # of channels)
If we used a 3*3 filter we would have 27 inputs and 32 outputs, which Jeremy says is âlosing informationâ. Why would that be losing information when we have more outputs than inputs?
Is the general rule of thumb to have inputs be a multiple of outputs for conv layers?
So in the second conv layer, we have 3332 = 288 inputs and 64 outputs (assuming 64 channels) which means 3*3 is adapted if we double the # of channels. Maybe a 4 to 1 ratio is about right?
Can anyone provide some good intuition as to why we want to do this? Are we trying to insert the data into a funnel so that we can squeeze out the useful info out of it?
What about when we are decreasing the # of channels (e.g. on the âupâ part of a Unet)? If we have a convolution with 512 input channels and 256 output channels and a 33 convolution, then we have 33*512 = 4608 inputs and 256 outputs. So even more squeezing, with a 20 to 1 ratio.
I find that itâs an interesting way to analyze model architectures, and I wonder if it could provide a way to improve them!
I am having a similar question, but perhaps a bit more generic. What is a good way of thinking about dimensions of layers in a network architecture?
Youâre using the term inputs in your âmeasurementsâ, but they arenât inputs - but input activations (video 1:09:00)
I donât have enough experience to try to explain this on my own, but after re-listening to the few minutes of video around 1:09-1:11, my understanding is that when Jeremy says âlosing informationâ he really means âwasting informationâ. If you go from 9->8 with linear transformation that layer becomes almost useless, since it took almost 8 bits (9) and turned them back into 8 bits, when you go from 27->32 itâs wasteful since you took 27 bits and turned them into 32 bits (extrapolation).
So I think Jeremy is looking for some sort of sufficient compression and making the optimizer for the first conv layer work hard to pick kernel weights so that itâll gather important info and discard unimportant info.
So he suggests imagenet does 147->32 and the class nb does 25->8 - so a compression of about 3-5 times. (also note that this is over 2 to 3 dimensions so the x3-5 is over a flattened size)
I donât know if my intuition is a good one. But I think his main point was to keep the network design to be lean and mean.
On the other hand, those conv layers are immediately followed by ReLU, in which case itâs no longer a wasteful 1:1 re-mapping of bits, since we have non-linearity there. And this point is not clear to me. W/o ReLU what I wrote above makes sense, since then we just reshuffle bits, but with ReLU itâs no longer a linear re-mapping or a wasteful extrapolation.
Yes exactly - later in the video there was a question asked about this, and in my answer I said that âlosing informationâ was a poor way for me to express it.
Even with the relu, I donât think that having more channels out than receptive field in makes sense. But Iâm not 100% sure. It needs some experiments and more thinking!
Great question - hereâs a paper where they tried to answer that question:
Thanks @stas . I think itâs an interesting subject to discuss. Makes sense to say that if we have an extrapolation we are being wasteful. Not sure if itâs a waste of data, maybe more a waste of computation and a waste of a layer?
Regarding ReLU: since it applies to each individual output activation, I believe it wouldnât change the mapping between input and output activations, but instead just zero out the negative outputs.
Given that a compression of 4-5 is a bit of a standard, I find it interesting to look at different types of layers and see what kind of compression they imply:
The up part of Unet uses 20x compression - is it better/worse than trying to keep it at 4x? Would need some experimentingâŚ
Trans convs:
http://deeplearning.net/software/theano/_images/padding_strides_odd_transposed.gif
Looking at the animated gif, we end up with compressions varying between 1 and 4, with an average around 1.8 (if I counted right).
If instead our upsampling consists in a 33 con with 4 times more channels , followed by a PixelShuffle(2), we have a 33/4 = 2.25 compression (part-2-lesson-14-2018)
Maybe worth dividing the output # of channels by 2 and get a bit more compression out of the upsampling layer, and fewer parameters in the processâŚ
Yes, your definition is much more precise, @Seb, than âwaste of dataâ, that I said. Thank you!
My feeling is that we are in the early days where we still need to tell DL boxes what to do - i.e. we still use algorithms, and it probably wonât be too long before architectures will be self-designing, so such questions would be a thing of a past. But we arenât quite there yet, so we do need to understand such situations well so that we could do better.
Regarding ReLU: since it applies to each individual output activation, I believe it wouldnât change the mapping between input and output activations, but instead just zero out the negative outputs.
But thatâs the whole point of ReLU - it changes the mapping between input and output, donât forget the gradients! If itâs somewhat linear still after the forward pass, itâs very non-linear after the backward pass.
One way to look at it might be that having more outputs than inputs means diluting information.
If you think about this not for images and convolutions but in terms of linear layers, I think it is easier to understand why it isnât that useful to have more outputs than inputs here: A linear layer of, say, 10d inputs and 100d outputs will map the 10d input space into a 10d subspace of the 100d output space. As such, you have an inefficient representation of a 10d space. (Of course, if you have a ReLU nonlinearity, you have +/- so you might reasonably expect to fill 20d with positive numbers, but 100d still is wasteful.)
But convolutions are particularly linear maps between patches (e.g. look at the torch.nn.Unfold example), i.e. you have (in-channels * w * h)
inputs and out-channels
outputs, so, as the dimension counting works as you (and Jeremy) propose.
Now we ask for out-channels < in-channels * w * h
if we want the output to be a somehow compressed representation (an extreme case for linear is word vectors or other dense representation of categorical features). However, in the up part of a Unet, I would think that we are not really compressing information any more.
You could probably reason about what dimension the space of âreasonable imagesâ has (much lower than pixel-count) and that youâd need to come from a representation that actually resolves this.
Itâd be very interesting to see if one can find a pattern of how much is a good ratio and where.
At some point, things like stride vs. correlation of adjacent outputs and pooling probably also play a role.
Best regards
Thomas
Thanks!
I would think you would not have compression in the up sampling part of UNet because the dimension of âreasonableâ outputs is typically lower than the number of pixels. For segmentation maps that certainly seems to be the case. For natural images my mental model is wavelet compression, which - albeit lossy - seems to hint at a limited output space. As such, if we have an efficient representation of whatâs going on, my intuition would be that we have some sort of decompression because the output isnât as efficient.