One way to think about it is via the „movement“ of the convolutional filter. In an image the conv filter gets moved horizontally and vertically (so in x and y) across the image, so in 2 dimensions, hence a Conv2D, no matter whether it is a greyscale image (1 channel), color image (3 channels) or medical or sattelite image with 4 or more channels. The only thing that changes is that the conv2D filter needs to have a matching amount of in-channels in the third dimension.
In a 3D conv, the 3rd dimension is the depth and the conv filter gets moved along that dimension too, so for example a 3x3x3 filter also gets moved in x, y and z across the volume. The input in that case has more than 3 dimensions, for example x,y,z and reflectivity for some lidar data.