Hi All –
I was looking at the implementation of ResNet50 in Keras, and was wondering whether it might be possible to speed the model up, as there appears to be some unnecessary computation going on.
strides = (2,2) the
conv_block blocks downsample by just ignoring every other spatial position, so it seems like we could just not compute the ignored positions in the first place. Anyone have any thoughts on this?
A broader question is why the ResNet authors did downsampling this way – it seems like they’re potentially throwing away information that would be captured w/ avg- or max-pooling instead of the 2-strided 1x1 convolutions.
This paper argues for replacement of max-pooling by pure convolutions with higher stride lengths, at least in part on the basis of simplicity (I haven’t read it): you only need one kind of layer (convolution), not two (convolution + max-pooling):
I imagine another reason is the one you point out (though in the other direction): I think max-pooling is avoided more now than in the past because it is thought to lose more information than increased convolution stride lengths. I don’t have a rigorous argument, but my intuition points out that the more familiar (to me) convolution operation as used in for example Physics and traditional image processing is reversible, whereas discarding values is not (but the ‘convolution’ in deep learning isn’t quite the same operation as that I think). For example an image blurred using convolution can be fully un-blurred if you know the kernel function (which has caught some people by surprise who thought they’d hidden information that way!). Of course if you have a non-unity stride length you’re still losing information, but my guess is there is still reversibility in some restricted sense that takes account of the downsampling, and that that is not the case for max-pooling.
Actually the abstract for that paper mentions deconvolution (‘unblurring’), which I know is also a concept that has been put to work in deep learning models, and though I haven’t read about that either, I wouldn’t be surprised if max-pooling gets in the way of that because of irreversibility.
As for whether there’s a more efficient implementation, why not try it and see?
stop press: there are some good arguments here: https://www.reddit.com/r/MachineLearning/comments/5x4jbt/d_strided_convolutions_vs_pooling_layers_pros_and/