I think they are the validation set activations that are getting appended to the training activations. data.train_dl has length 98 minibatches. The spikes occur every 108 minibatches.
If for a single epoch we first go through the training data (98 batches) and then the validation data (20 batches) then all we need to do is plot only the first 98 statistics - as you say remove validation from the pic.
To ensure we plot data only for the convolutional layers we can make this number a variable.
Around 1:11:00, Jeremy talks about using larger filter sizes (e.g. 55 or 77) for the first convolution layer. So, we have for each convolution e.g. 773 = 147 inputs (3 channels), and 32 outputs (= # of channels)
If we used a 3*3 filter we would have 27 inputs and 32 outputs, which Jeremy says is “losing information”. Why would that be losing information when we have more outputs than inputs?
Is the general rule of thumb to have inputs be a multiple of outputs for conv layers?
So in the second conv layer, we have 3332 = 288 inputs and 64 outputs (assuming 64 channels) which means 3*3 is adapted if we double the # of channels. Maybe a 4 to 1 ratio is about right?
Can anyone provide some good intuition as to why we want to do this? Are we trying to insert the data into a funnel so that we can squeeze out the useful info out of it?
What about when we are decreasing the # of channels (e.g. on the “up” part of a Unet)? If we have a convolution with 512 input channels and 256 output channels and a 33 convolution, then we have 33*512 = 4608 inputs and 256 outputs. So even more squeezing, with a 20 to 1 ratio.
I find that it’s an interesting way to analyze model architectures, and I wonder if it could provide a way to improve them!
You’re using the term inputs in your “measurements”, but they aren’t inputs - but input activations (video 1:09:00)
I don’t have enough experience to try to explain this on my own, but after re-listening to the few minutes of video around 1:09-1:11, my understanding is that when Jeremy says “losing information” he really means “wasting information”. If you go from 9->8 with linear transformation that layer becomes almost useless, since it took almost 8 bits (9) and turned them back into 8 bits, when you go from 27->32 it’s wasteful since you took 27 bits and turned them into 32 bits (extrapolation).
So I think Jeremy is looking for some sort of sufficient compression and making the optimizer for the first conv layer work hard to pick kernel weights so that it’ll gather important info and discard unimportant info.
So he suggests imagenet does 147->32 and the class nb does 25->8 - so a compression of about 3-5 times. (also note that this is over 2 to 3 dimensions so the x3-5 is over a flattened size)
I don’t know if my intuition is a good one. But I think his main point was to keep the network design to be lean and mean.
On the other hand, those conv layers are immediately followed by ReLU, in which case it’s no longer a wasteful 1:1 re-mapping of bits, since we have non-linearity there. And this point is not clear to me. W/o ReLU what I wrote above makes sense, since then we just reshuffle bits, but with ReLU it’s no longer a linear re-mapping or a wasteful extrapolation.
Yes exactly - later in the video there was a question asked about this, and in my answer I said that “losing information” was a poor way for me to express it.
Even with the relu, I don’t think that having more channels out than receptive field in makes sense. But I’m not 100% sure. It needs some experiments and more thinking!
Thanks @stas . I think it’s an interesting subject to discuss. Makes sense to say that if we have an extrapolation we are being wasteful. Not sure if it’s a waste of data, maybe more a waste of computation and a waste of a layer?
Regarding ReLU: since it applies to each individual output activation, I believe it wouldn’t change the mapping between input and output activations, but instead just zero out the negative outputs.
Given that a compression of 4-5 is a bit of a standard, I find it interesting to look at different types of layers and see what kind of compression they imply:
The up part of Unet uses 20x compression - is it better/worse than trying to keep it at 4x? Would need some experimenting…
Looking at the animated gif, we end up with compressions varying between 1 and 4, with an average around 1.8 (if I counted right).
If instead our upsampling consists in a 33 con with 4 times more channels , followed by a PixelShuffle(2), we have a 33/4 = 2.25 compression (part-2-lesson-14-2018)
Maybe worth dividing the output # of channels by 2 and get a bit more compression out of the upsampling layer, and fewer parameters in the process…
Yes, your definition is much more precise, @Seb, than “waste of data”, that I said. Thank you!
My feeling is that we are in the early days where we still need to tell DL boxes what to do - i.e. we still use algorithms, and it probably won’t be too long before architectures will be self-designing, so such questions would be a thing of a past. But we aren’t quite there yet, so we do need to understand such situations well so that we could do better.
Regarding ReLU: since it applies to each individual output activation, I believe it wouldn’t change the mapping between input and output activations, but instead just zero out the negative outputs.
But that’s the whole point of ReLU - it changes the mapping between input and output, don’t forget the gradients! If it’s somewhat linear still after the forward pass, it’s very non-linear after the backward pass.