I want more intuition on why DenseNet/Tiramisu works so well. I read the paper and know that it is more efficient due to parameter reuse but I thought they would mention multi-context of the filters more. Similar to pyramid/a trous convolutions, can anyone elaborate on this aspect of DenseNets.
Here is my intuitive take on this (translation: my take, with no basis in theory and could be entirely wrong but still worth pondering.)
One analogy I have about neural networks is it is like differential equations. In VGG for example, when you go from layer one to layer n, you can see the filters getting more complex… first layer is just color, lines, etc and last layer is eyes, ears or some recognizable part from the imagenet – which is some combination of the previous layers filters.
In the past, most of the concern was about propogating weights from the last layer to the first which becomes increasingly diminishing as the depth of the network increases. Resnet addresses this by “short-circuiting” the path. This works really really well for learning. However, it does not do much for recognition.
The key insight from the deepnet paper is that you can use the “lower order” coeffients to recognize images. To me this is equivalent of saying “VGG, Resnet, etc use only nth order differential equation for recognizing objects whereas Densenet also utilizes lower layer activations”.
Put differently, VGG does not take into account that grass is green (which is something that you can infer from first layer activations) when detecting a playground, but densenet could. This is why, I intuitively believe, that densenets perform better despite having lower number of parameters.
I’d love to see visualizations like mattew zeiler did with VGG. Since densenet got the CVPR best paper award, I expect more interesting papers on this to follow.