Are there any successful vision models that use 3d convolutions?

Are there any successful vision models that use 3d convolutions? I notice that its mostly 2d. Is there anything that works well with 3d convolutions that is state of the are on any of the standard data sets e.g.

cifar10
imagenet
mnist?


question inspired from:

3D CNNs are used at least for volumetric data and hyperspectral image classification. There is no advantage to use 3D convolutions for imagenet or cifar 10, because they only have 3 channels . You can think of it this way: both approaches find interesting features, but when 2d cnns find those in only two dimensions (x and y axis), 3d filters account also the z-axis (time, volume, spectral dimension).

Some examples where 3D CNN:s are used:

V-Net for volumetric image segmentation https://arxiv.org/pdf/1606.04797.pdf
Smoke detection on Video Sequences https://link.springer.com/article/10.1007/s10694-019-00832-w
Spectral-spatial classification of Hyperspectral Imagery: https://www.mdpi.com/2072-4292/9/1/67

1 Like

related post: https://ai.stackexchange.com/questions/13975/is-there-any-use-of-using-3d-convolutions-for-traditional-images-like-cifar10

I am curious if its of any use on images like cifar10…


My own thoughts:

I am curious if there is any advantage of using 3D convolutions on Images like Cifar10/100 or Imagenet. I know that they are not usually used on this data set, though they could because the channel could be used as the “depth” channel.

I know that there are only 3 channels, but lets think more deeply. They could be used deeper in the architecture despite the input image only using 3 channels. So we could have at any point in the depth of the network something like (C_F,H,W) where C_F is dictated by the number of filters and then apply a 3D convolution with kernal size less than C_F in the depth dimension.

Is there any point in doing that? When is this helpful? When is it not helpful?

I am assuming (though I have no mathematical proof or any empirical evidence) that if the first layer aggregates all input pixels/activations and disregards locality (like a fully connected layer or conv2D that just aggreagates all the depth numbers in the feature space), then 3D convs wouldn’t do much cuz earlier layers destroyed the locality structure in that dimension anyway. It sounds plausible but lack any evidence or theory to support it. I know Deep Learning uses empirical evidence to support its claims so perhaps there is something that confirms my intuition?

Any ideas?

1 Like

I had the same question. After some digging, there’s a similar concept called “depthwise separable convolutions”.

It’s a way to replicate a 2D conv + 3D filter + 3D input volume with less parameters and computation. Here’s a good blog post with more info & diagrams.

Side note: Depthwise separable convolutions uses multiple 2D filters. I’m also wondering if people do this with a single 2D filter.

You may check these two kaggle completions. There are also some specific 3d datasets available on kaggle site.
link 1 and link 2