How depthwise separable convolutions work

Hey everybody,

I’m currently working on an article that gives on overview of different kinds of convolutions we have today. For this purpose I want to make sure that i fully understood depthwise separable convolutions.

a regular 3x3 convolution over 16 input channels and 32 output channels does the following: every single of the 16 channels is traversed by 32 3x3 kernels resulting in a total of 4608 (16x32x3x3) parameters. we now have 32 different feature maps for each of the 16 channels. now we take one feature map out of every 16 input channels and add them together. since we can do that 32 times, we get the 32 output channels we wanted.

for depthwise separable convolutions on the same setup we traverse each of the 16 channels with 1 3x3 kernel resulting in 16 feature maps. each of these feature maps in then traversed by 32 1x1 convolutions resulting in 512 (16x32) feature maps. now we take 1 feature map out of each of the 16 input channel and add them up. since we can do that 32 times, we get the 32 output channels we wanted. the total number of parameters can be calculated by 16x3x3 + 16x32x1x1 = 656 parameters.

does that sound about right?

That looks right to me, though I think about it a little differently.

For a regular 3x3 convolution with 16 input channels and 32 output channels, I think of it as a bunch (32) of 3x3x16 kernels, each producing one output channel.

For a “depthwise convolution,” I think of it as a single 3x3x16 kernel, but whereas we normally sum across the input channels, here we keep them separate. So that one kernel will produce (preserve) 16 outputs channels.

A “pointwise convolution” is a regular convolution where each kernel is size 1x1x16.

And a “depthwise separable convolution” is a depthwise convolution followed by a pointwise convolution.

1 Like