I noticed that the downsample method used for the resnet networks works with stride 2 convolutions. That is fine, but what worries me is that kernel_size is set to… 1! Doesn’t that skip most of the image? My understading is that kernel_size=1 and stride=2 looks something like:
That stride 2 is skipping every second pixel would not shock me because dimension reduction with for example with max or average pooling would also discard a lot of information too (which is in the end what you do when you reduce the dimensions).
1x1 convolutions don’t bother me at all. Stride 2 convolutions don’t bother me. What bothers me is the combination of the two. In fact, whenever you have kernel_size < stride, you’ll have some pixels that can’t contribute anything to the final answer (unless my understanding is incorrect).
Avg pooling doesn’t discard any information, and max pooling discards the information of the lower values, but that’s fine by me, as every pixel has an equal opportunity of being the maximum. I don’t understand why you’d simply discard 3/4 of the information. Those pixels have no chance of ever being taken into account. Why compute them at all?
I saw that this comes from pytorch. In torchvision.models.resnet* they have the same thing, so I’ll ask there.
Bit of an old thread. kernel_size=1, stride=2 conv is the standard downsampling in the ResNet shortcut (identity) path. Yes, it throws away data, so does any basic sub-sampling image downscaling algorithm.
There are improvements on that aspect of ResNet. SENet used kernel_size=3. The ‘Bag of Tricks’ ResNet enhancements uses AvgPool before the 1x1, it uses AvgPool for spatial downsample and the 1x1 for channel matching.
The fastai xresnet uses avg pooling in the shortcut.