I am reading the paper at https://arxiv.org/abs/1608.06993
It says
Crucially, in contrast to ResNets, we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them.
I am trying to visualize in my head how that would work. It is easy to draw a simple diagram as they do at the beginning but I wanted a bit more like sizes of the matrices.
Normally we would have this
input = 10x10 black and white image so 1 channel which I will ignore for simplicity
layer1 filter = 2x2 stride 1 (no zero padding)
output first layer = 9x9
layer2 filter = 2x2 stride 1 (no zero padding)
output 2nd layer = 8x8
In case of DenseNet we would have first layer same i.e
input = 10x10 black and white image so 1 channel which I will ignore for simplicity
layer1 filter = 2x2 stride 1 (no zero padding)
output first layer = 9x9
But for the 2nd layer would we have a filter which is 4x2 stride 1? They say concat so I am concatenating the filters horizontally. Maybe they concat vertically?