What is the difference between 2d vs 3d convolutions?

I was trying to understand the definition of 2d convolutions vs 3d convolutions. I saw the “simplest definition” according to Pytorch and it seems the following:

  • 2d convolutions map (N,C_in,H,W) -> (N,C_out,H_out,W_out)
  • 3d convolutions map (N,C_in,D,H,W) -> (N,C_out,D_out,H_out,W_out)

Which make sense to me. However, what I find confusing is that I would have expected images to be considered 3D tensors but we apply 2D convolutions to them. Why is that? Why is the Channel tensor not part of the “definitionality of the images”?

Cross posted:

1 Like

One way to think about it is via the „movement“ of the convolutional filter. In an image the conv filter gets moved horizontally and vertically (so in x and y) across the image, so in 2 dimensions, hence a Conv2D, no matter whether it is a greyscale image (1 channel), color image (3 channels) or medical or sattelite image with 4 or more channels. The only thing that changes is that the conv2D filter needs to have a matching amount of in-channels in the third dimension.

In a 3D conv, the 3rd dimension is the depth and the conv filter gets moved along that dimension too, so for example a 3x3x3 filter also gets moved in x, y and z across the volume. The input in that case has more than 3 dimensions, for example x,y,z and reflectivity for some lidar data.


That is a great explanation!

Thank you!

I am curious, are there any successful vision models that work using 3d convolutions (or 1d)?

I made a question: Are there any successful vision models that use 3d convolutions?