Hi eduedix,
I didn’t mean linear activations, I meant linear operations, i.e. the convolutions themselves are essentially multiplications between weights and values. So convolutions stacked on convolutions (without a non-linear activation / break between the convolutions) are like a system of equations that can be represented with a single equation. That stated, I believe I’ve discovered the answer to that particular question! See this paper, titled Rethinking the Inception Architecture for Computer Vision on the page numbered 2820, section "3. Factorizing Convolutions with Large Filter Size.
In summary, by stacking smaller convolutions on top of one another, one can essentially mimic larger convolutions more effectively (less parameters, less multiplications, thus faster training and smaller footprint).
I was also able to figure out the answer to the first question by reading some other paper or maybe it was a comment in someone’s source code. Essentially, dropout in a fully convolutional network is like zero’ing out random pixels in an image. Still pretty easy to tell what the image is since there’s so much correlation within the network, especially 2D and 3D FCNs. Even if you increase dropout to 50%. On the other hand, spatial dropout zero’s out entire convolutional filters, which leads to the same regularization effects that we see in fully connected / feed forward networks.
I don’t recall which one it was, but one of the papers / projects on this site had some notes on deconv / inverseconv but I still haven’t wrapped my head around it yet. And I’m still interested in the relu/prelu mix.