DenseNets - Regarding how they concat features

I am reading the paper at https://arxiv.org/abs/1608.06993

It says

Crucially, in contrast to ResNets, we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them.

I am trying to visualize in my head how that would work. It is easy to draw a simple diagram as they do at the beginning but I wanted a bit more like sizes of the matrices.

Normally we would have this

 input = 10x10 black and white image so 1 channel which I will ignore for simplicity
layer1 filter = 2x2 stride 1 (no zero padding)
output first layer = 9x9

layer2 filter = 2x2 stride 1 (no zero padding)
output 2nd layer = 8x8

In case of DenseNet we would have first layer same i.e

input = 10x10 black and white image so 1 channel which I will ignore for simplicity
layer1 filter = 2x2 stride 1 (no zero padding)
output first layer = 9x9

But for the 2nd layer would we have a filter which is 4x2 stride 1? They say concat so I am concatenating the filters horizontally. Maybe they concat vertically?

So reading more I understood how they concat.

Dense layer with input 10x10
Dense layer with input 20x10 (10x10 from previous input, 10x10 from previous output)

Dense layer with input 10Lx10 where L is number of dense layers

Then there is convolution

1 Like