Deconvolutions, Stacking, PReLUs, and Spatial Dropout

  • Dropout randomly zeros out activations in order to force the network to generalize better, do less overfitting, and essentially build in redundancies as a regularization technique. Spatial dropout in convolutional layers zeros out entire filters, much to the same purpose. This paper states that:

However, applying dropout technique to convolutional layers is not commonly recommended when
it comes to training deep and large network. It usually gives us poor performance than the model
trained without dropout. There are a few of works that argued that dropout over convolutional layers
also gives us additional performance improvement. However, its experiments were limited over
relatively small size datasets and networks [11]. What people usually have been doing when they try
to train large and deep CNNs is to apply dropout to the last two or three fully connected layers [7].
It turns out that this method achieved state-of-art results for most of recent CNN architectures.

The above is more or less in-line with what we’ve been seeing in this course. What I’m looking for is some insight as to why spatial dropout (e.g. groupout) outperforms baseline, and why dropout underperforms baseline in regular CNNs.

  • Next question – do any of you have advice on mixing relu’s with leaky/prelu’s within a single network? What portions of the architecture (e.g. encoding) are optimal for prelus, vs what portions of the architecture (e.g. decoding) are more optimal for regular relus?

  • Since convolutions are essentially linear operations which can be stacked, why do some networks stack multiple convolutions on top of one another before performing a non-linearity? Is that a n00b mistake? Are the network architects simply using the exposed APIs to accomplish their real objective of creating said mixed filter? Or is there something else at play here that I’m missing?

  • I’ve been looking at a lot of different U-Net implementations and I’ve noticed that some people use simple upsampling, while others opt to use deconvolution. I’m pretty sure I completely understand convolutions but I’m struggling understanding deconvolutions. Does anyone have a good analogy or solid explanations of what and how deconvolutional layers work?

1 Like

can you give any specific example of a network with stacked convolutional layers with linear activations?

Hi eduedix,

I didn’t mean linear activations, I meant linear operations, i.e. the convolutions themselves are essentially multiplications between weights and values. So convolutions stacked on convolutions (without a non-linear activation / break between the convolutions) are like a system of equations that can be represented with a single equation. That stated, I believe I’ve discovered the answer to that particular question! See this paper, titled Rethinking the Inception Architecture for Computer Vision on the page numbered 2820, section "3. Factorizing Convolutions with Large Filter Size.

In summary, by stacking smaller convolutions on top of one another, one can essentially mimic larger convolutions more effectively (less parameters, less multiplications, thus faster training and smaller footprint).

I was also able to figure out the answer to the first question by reading some other paper or maybe it was a comment in someone’s source code. Essentially, dropout in a fully convolutional network is like zero’ing out random pixels in an image. Still pretty easy to tell what the image is since there’s so much correlation within the network, especially 2D and 3D FCNs. Even if you increase dropout to 50%. On the other hand, spatial dropout zero’s out entire convolutional filters, which leads to the same regularization effects that we see in fully connected / feed forward networks.

I don’t recall which one it was, but one of the papers / projects on this site had some notes on deconv / inverseconv but I still haven’t wrapped my head around it yet. And I’m still interested in the relu/prelu mix.

1 Like

One point I want to add to why the answer is true: In CNNs you often want to do your regularization (layer norm, batch norm or dropout) per channel/feature map. This is because you want to preserve the convolutional property, i.e different images of the feature map are normalized in the same way. Thus think of feature maps/channels equivalent to a single unit in a fully connected NN

You can combine convolutional layers to get the same receptive field as a larger kernel with fewer computations.

E.g. Instead of a 5x5, do two 3x3 convolutions (9+9 < 25). The second layer will be using an effective 5x5 receptive field (so even with linear activation it’s not possible to replace two 3x3 layers with a single 3x3 layer, although you could with a 5x5).

As far as upsampling goes - this method removes checkerboard artifact vs straight deconvolution.

See: http://distill.pub/2016/deconv-checkerboard/