related post: https://ai.stackexchange.com/questions/13975/is-there-any-use-of-using-3d-convolutions-for-traditional-images-like-cifar10
I am curious if its of any use on images like cifar10…
My own thoughts:
I am curious if there is any advantage of using 3D convolutions on Images like Cifar10/100 or Imagenet. I know that they are not usually used on this data set, though they could because the channel could be used as the “depth” channel.
I know that there are only 3 channels, but lets think more deeply. They could be used deeper in the architecture despite the input image only using 3 channels. So we could have at any point in the depth of the network something like (C_F,H,W) where C_F is dictated by the number of filters and then apply a 3D convolution with kernal size less than C_F in the depth dimension.
Is there any point in doing that? When is this helpful? When is it not helpful?
I am assuming (though I have no mathematical proof or any empirical evidence) that if the first layer aggregates all input pixels/activations and disregards locality (like a fully connected layer or conv2D that just aggreagates all the depth numbers in the feature space), then 3D convs wouldn’t do much cuz earlier layers destroyed the locality structure in that dimension anyway. It sounds plausible but lack any evidence or theory to support it. I know Deep Learning uses empirical evidence to support its claims so perhaps there is something that confirms my intuition?