lets finish your thought to see where it breaks apart. so instead of using maxpooling and small kernels, we dont downsample the images and use 7x7 kernels. our input is rgb with 3 channels and a resolution of 224x224. the first layer will have 128 channels which we continue use for the next 4 layers as well. at the end lets use a dense layer with 1000 outputs just like imagenet. cool? cool.
the network we just described has 5 billion parameters. the roughly 40x larger than the already heavy VGG16. so thats just not going to work. we are interested in keeping the parameters as low as possible but still not loosing accuracy.
once upon a time somebody noticed that we can do that pretty well by reducing the spatial resoltuion of an image (width & height) by a factor of 4 and then doubling the number of channels. by doing so we reduce the amount of information by 50% for every step in our model. why are we doing conv + maxpool? because it works.
using 2 3x3 kernels behind each other we get a view of 5x5 pixels, but only use 18 instead of 25 parameters. using 3 3x3 kernels its a 7x7 view, but we only use 27 instead of 49 parameters. if we now drop maxpool in between the respective field grows even further. 2 3x3 kernels will also give better results than 1 5x5, so thats good as well.