Is maxpooling better than just using a larger filter? If so then why?

simoneva · July 12, 2017, 8:28am

Most of the architectures seem to follow a conv layer with maxpooling to look for larger components in the image. However the same effect could be achieved by using a larger filter on the original input.

Is the maxpooling method more effective? This seems counter intuitive because maxpooling destroys information whereas a larger filter does not.

pietz · July 14, 2017, 7:39am

good question.

lets finish your thought to see where it breaks apart. so instead of using maxpooling and small kernels, we dont downsample the images and use 7x7 kernels. our input is rgb with 3 channels and a resolution of 224x224. the first layer will have 128 channels which we continue use for the next 4 layers as well. at the end lets use a dense layer with 1000 outputs just like imagenet. cool? cool.

the network we just described has 5 billion parameters. the roughly 40x larger than the already heavy VGG16. so thats just not going to work. we are interested in keeping the parameters as low as possible but still not loosing accuracy.

once upon a time somebody noticed that we can do that pretty well by reducing the spatial resoltuion of an image (width & height) by a factor of 4 and then doubling the number of channels. by doing so we reduce the amount of information by 50% for every step in our model. why are we doing conv + maxpool? because it works.

using 2 3x3 kernels behind each other we get a view of 5x5 pixels, but only use 18 instead of 25 parameters. using 3 3x3 kernels its a 7x7 view, but we only use 27 instead of 49 parameters. if we now drop maxpool in between the respective field grows even further. 2 3x3 kernels will also give better results than 1 5x5, so thats good as well.

simoneva · July 14, 2017, 11:00am

Thanks for the answer. There seems to be a trend towards deeper networks with more parameters so I guess the main driver is the cost rather than parameters - maxpool plus more channels is cheaper than bigger filters.

I note that inception architecture does use multiple filter sizes but even that architecture is also using maxpooling.