Trying to understand the thought process behind the VGG architecture and I have these following questions.
It was told that we are increasing filter size because we are using max pooling to improve information gathering. But over the last few layers in the VGG, the filter size remained same when vgg was max pooling to 14x14 and 7x7 image size. Why wasn’t there the need to increase filter size there?
Also, few consecutive layers, in the end, was constructed with same filter and image size, that was built just to increase accuracy?
I just couldn’t wrap around the discussion that that filters look at the larger part of the image at final layers, when identifying eye and position inferring with nose or other features for facial recognition for example. Or to put it better, final filters has the entire face as the feature as I understood through convolution visualizing. But max pooling causes us to see only a subset part of the image right? When filter size is 512x512 (face as the filter/feature) but image size as 7x7, how does it work?