While presenting this slide @jeremy mentions two Conv2d operations performed in succession:
The first one is a Conv2d that takes the outputs from the resnet model of shape
(7, 7, <num channels>) to a new shape of
(4, 4, 4+<num_classes>).
In the lecture we are not provided the other settings for the convolution but I guess they would be easy to figure out by looking at the notebook. My guess is that they are performed with a
(3, 3) kernel and a padding of 1. These are quite common settings that preserve feature map size with a stride of 1 and I assume they might be what we go here for with a stride of 2, which gives us the ‘(4 ,4)’ feature maps.
Here however, we perform another set of convolutions going from
(4, 4) to
(2, 2). Given how these two convolutions seem to be doing roughly the same thing, I would expect their parameters to be the same. But I don’t see how we can go from
(4, 4) to
(2, 2) with a filter size of
(3, 3) and a stride of 2. We could do one side padding but that sounds absolutely horrible
The only settings that seem reasonable here for the 2nd convolution would be padding of 0 and a
(2, 2) kernel.
But is this really what is happening here? More interestingly, if these convolutions don’t share parameters, why is that?
I was really blown away by the observation that a receptive field will ‘look more’ at what is in the center. (this is nicely shown using excel where there are more values feeding into the center of a receptive field than its sides). Could this be a factor that plays into the conv params here? If we want to look as best as we can at a square we should probably look at the center given the nature of a receptive field and the padding of 1 is counter productive. Going from
(4, 4) to
(2, 2) seems to be doing just that.
But why the earlier convolution?
Or maybe this whole reasoning is wrong and there is something else happening here?