This is a question which I’m not sure has an answer.

It is common practice to have a larger first layer for convNets, then have a 3x3 for many layers going down. The thing I don’t understand is - why?

2 3x3 conv layers have a receptive field of 5x5, and have fewer mathematical operations and more non-linearities. So they should be faster and able to create more complex functions. (we’ll ignore concatenated 1x3 and 3x1 layers, though the same logic should apply)

Take a 10x10 input, 100 pixels. If we convolve a 5x5 over each of the 100 pixels, we have 2500 multiplies, and 2400 adds. If we convolve a 3x3 over each of the 100 pixels, we have 900 multiplies and 800 adds. Do that again, and we have 1800 multiplies, and 1600 adds.

I was told that the reason behind using the 5x5 is that even though 2 3x3s have the same receptive field, the math works out to be different, and so that is why sometimes larger kernels are used. This is fine at first, but on thinking on it more, 2 3x3s should be able to learn the same weights to be equivalent to our 5x5 kernel results if that proved to be optimal, but also have the advantage of learning functions the 5x5 cannot. I will spare you the math, but I proved to myself that 2 simple 1x2 layers could learn any function that 1 1x3 layer could, so I’m decently confident that this extrapolates out to higher dimensions.

The leading ideas I can come up with.

- 5x5 works as a quasi regularization technique that prevents 2 3x3 convs from overfitting
- While 2 3x3s could learn to be a 5x5, it would take more training to get there than starting with a more linear function.

(but that doesn’t make sense, since the 2 3x3s have 18 weights that need adjusting compared to the 25 5x5, but maybe the nonlinearities make it harder to arrive at a solution for those 18 weights?) - 2 3x3 will use more memory in activation map creation that could be saved with 1 5x5 (then my question is, why not use dilated convolution, but since you are throwing away information in dilation, that one is a little easier to understand the trade offs )