Kernel Size - Why would you ever make it larger?

This is a question which I’m not sure has an answer.

It is common practice to have a larger first layer for convNets, then have a 3x3 for many layers going down. The thing I don’t understand is - why?

2 3x3 conv layers have a receptive field of 5x5, and have fewer mathematical operations and more non-linearities. So they should be faster and able to create more complex functions. (we’ll ignore concatenated 1x3 and 3x1 layers, though the same logic should apply)

Take a 10x10 input, 100 pixels. If we convolve a 5x5 over each of the 100 pixels, we have 2500 multiplies, and 2400 adds. If we convolve a 3x3 over each of the 100 pixels, we have 900 multiplies and 800 adds. Do that again, and we have 1800 multiplies, and 1600 adds.

I was told that the reason behind using the 5x5 is that even though 2 3x3s have the same receptive field, the math works out to be different, and so that is why sometimes larger kernels are used. This is fine at first, but on thinking on it more, 2 3x3s should be able to learn the same weights to be equivalent to our 5x5 kernel results if that proved to be optimal, but also have the advantage of learning functions the 5x5 cannot. I will spare you the math, but I proved to myself that 2 simple 1x2 layers could learn any function that 1 1x3 layer could, so I’m decently confident that this extrapolates out to higher dimensions.

The leading ideas I can come up with.

  1. 5x5 works as a quasi regularization technique that prevents 2 3x3 convs from overfitting
  2. While 2 3x3s could learn to be a 5x5, it would take more training to get there than starting with a more linear function.
    (but that doesn’t make sense, since the 2 3x3s have 18 weights that need adjusting compared to the 25 5x5, but maybe the nonlinearities make it harder to arrive at a solution for those 18 weights?)
  3. 2 3x3 will use more memory in activation map creation that could be saved with 1 5x5 (then my question is, why not use dilated convolution, but since you are throwing away information in dilation, that one is a little easier to understand the trade offs )

Great question, I’ve been wondering the same thing. Stacked 3x3 Convs seem to be the standard everywhere (sometimes also including 1x1 conv before it for feature compression), but even still, you often see the very first layer of a network with a larger conv. I’ve been spending a bunch of time with DenseNet this last week and they do the same thing on the ImageNet dataset.

Table 2 in the paper shows that the very first layer is a 7x7 stride 2 conv, and I’m wondering what the conceptual rationale for this is.

I think part of the answer is simply because AlexNet did it. Initial layer for that network used an 11 x 11 kernel.

The only paper I can remember directly contrasting multiple 3 x 3 kernels vs a single 5 x 5 or larger kernel came down in favor of multiple 3 x 3 kernels for exactly the reason mentioned above. (The exact paper eludes me right now - think it was one from Google.)

So I don’t think there is a really good reason to do it this way (although if someone else can think of a reason I’m all ears).

That said, there is a less compelling reason to use a larger initial kernel size - the first layer filters are much more visually appealing when you include them as a figure in your paper:

1 Like