I was trying to understand the definition of 2d convolutions vs 3d convolutions. I saw the “simplest definition” according to Pytorch and it seems the following:
- 2d convolutions map
(N,C_in,H,W) -> (N,C_out,H_out,W_out)
- 3d convolutions map
(N,C_in,D,H,W) -> (N,C_out,D_out,H_out,W_out)
Which make sense to me. However, what I find confusing is that I would have expected images to be considered 3D tensors but we apply 2D convolutions to them. Why is that? Why is the Channel tensor not part of the “definitionality of the images”?