While implementing XResnet from the Bag of tricks for Image Classification paper, we replace the 7x7 conv with three 3x3 convs in the input stem. It is said that those three 3x3 convs have a receptive field of 7x7.

According to the definition of receptive field, it is the number of pixels in input image that the conv layer uses for calculation. The first 3x3 conv has a receptive field of 3x3 because the filter size is 3x3. Then, the second conv layer convolves over the output of first conv layer which is based on a 3x3 receptive field. Similarly, third conv layer convolves over output of second layer which is based on first layer which has a receptive field of 3x3. Then how do we get a receptive field of 7x7?

The 3x3 conv were used because of the computation expense of the 7x7 conv. Can you tell the line that says 3x3 convs have receptive field of 7x7.

3x3 convolution is like a weighted average of the 3x3 neighborhood of each pixel. So a 3x3 after 3x3 is looking at the neighborhood of each of neighboring pixels, making it equivalent to a 5x5. Could that be it?

That’s exactly the reason. (@liuyao) Just trace which pixels in the original input image will affect a single pixel in the third layer.

Finally understood it after drawing the pixels tracing on paper , it was a bit hard to visualize the neighboring of pixels in mind.

Also, another way to think of it is, if you pass a 5x5 image to 2 stacked convs, you will get a 1x1 output feature map. Hence, the receptive field of second conv layer is 5x5. Similarly, for 7x7 input image and 3 stacked convs, you will get 1x1 output feature map.

Thanks guys