I'm currently working on an article that gives on overview of different kinds of convolutions we have today. For this purpose I want to make sure that i fully understood depthwise separable convolutions.
a regular 3x3 convolution over 16 input channels and 32 output channels does the following: every single of the 16 channels is traversed by 32 3x3 kernels resulting in a total of 4608 (16x32x3x3) parameters. we now have 32 different feature maps for each of the 16 channels. now we take one feature map out of every 16 input channels and add them together. since we can do that 32 times, we get the 32 output channels we wanted.
for depthwise separable convolutions on the same setup we traverse each of the 16 channels with 1 3x3 kernel resulting in 16 feature maps. each of these feature maps in then traversed by 32 1x1 convolutions resulting in 512 (16x32) feature maps. now we take 1 feature map out of each of the 16 input channel and add them up. since we can do that 32 times, we get the 32 output channels we wanted. the total number of parameters can be calculated by 16x3x3 + 16x32x1x1 = 656 parameters.
does that sound about right?