Are there any successful vision models that use 3d convolutions?

3D CNNs are used at least for volumetric data and hyperspectral image classification. There is no advantage to use 3D convolutions for imagenet or cifar 10, because they only have 3 channels . You can think of it this way: both approaches find interesting features, but when 2d cnns find those in only two dimensions (x and y axis), 3d filters account also the z-axis (time, volume, spectral dimension).

Some examples where 3D CNN:s are used:

V-Net for volumetric image segmentation https://arxiv.org/pdf/1606.04797.pdf
Smoke detection on Video Sequences https://link.springer.com/article/10.1007/s10694-019-00832-w
Spectral-spatial classification of Hyperspectral Imagery: https://www.mdpi.com/2072-4292/9/1/67

1 Like