Hi @jeremy, I am not very clear about how CNN account for (or fail to account for) the variations in the above properties. Here are my understandings for each. Please let me know if they are accurate.
Position of the subject in the picture: the model should be able to detect the subject regardless of its position in the picture, since the last dense or pooling layer will “summarize” the features in all sub-areas of the picture and come to a holistic conclusion. Therefore, even if the training data only contains pictures of cats in the top left corner, the model should be able to generalize with respect to position and successfully classify test data with cats in the bottom right corner.
Size and orientation: I do not think these two properties share the same benefit as the position stated above. If the training set only contains the subject in upright position with the same relative size to the picture, then the model will fail to detect test sets with subjects upside down or significantly larger/smaller than the training set.
Image dimension: if an image is not square, how do we make it square before feeding it to the model? Do we stretch it or add black areas to fill it up? My guess is the former. However, if an image is stretched, doesn’t it cause the performance of the model to deteriorate (unless test images are stretched in the same way)?
The dense layer is very sensitive to position in the picture. Each weight in the dense layer is specific to one convolutional filter activation in one location. So it’s not the dense layer that’s well-suited to handling this. Instead, it’s the convolutional layers and the max-pooling. So by the last max-pooling layer of vgg16, which outputs a 7x7 set of filters, the previous layers could have found a cat in any one of those locations
I am pondering about this answer and would like to make sure I understand dense and convolutions well. My understanding is:
Dense layer = Linear function applied to each pixel. So it does not handle any context of the pixel.
Convolution layer = A filter applied to blocks of pixels. So there is some handling of context here - that is - neighboring pixels.
Is my understanding so far accurate?
Not sure if your dense layer definition is correct or not - I think perhaps not. It’s a linear function applied to all pixels - every pixel and every filter has it’s own weight in the model.
Thanks Jeremy. I think I understand about size and distortion now, but still couldn’t wrap my head around how it can be position invariant. If the last max-pooling layer produced 7x7 output, doesn’t that mean each piece only covers 1/49th of the image? Now unless an object (e.g. a cat) fits perfectly in one of those 49 regions, how is the model able to see it?
This is a most interesting question. The dense layer can use weights that combine the different regions in appropriate ways for a cat - e.g. the head should be above the body, the body above the legs, etc. There’s enough parameters (about 100 million in the first dense layer!) that it can implement many different configurations to look for in terms of scale, rotation, etc, especially given that the underlying convolutional filters are themselves very sophisticated by that point (since they consist of many layers of computation themselves).
I can’t fit all this in my head at all - I don’t think there’s any way that a human can visualize the configurations that hundreds of millions of parameters can create. And this makes it hard to have good intuition in designing new architectures - and indeed, really new architectures come along only very rarely.
I think the main point to be aware of is that the filters in those 49 regions are doing something far more complex than matching a simple pattern. They are doing sophisticated computation at many levels.
Sorry I know this isn’t really a very complete answer to your question. I’m not sure there is one…
Jeremy, will be great if you can update this , as and when you get more understanding … really helpful