Code deep-dive ... understanding the application of "dropout"

Hi @all @jeremy

Noticed something very interesting about how dropout are applied. I’ve actually never looked at the source for dropout, and can’t seem to dig out the exact source (some .py file) where the logic for it is executed. It seems like this is done in some compiled CuDNN code.

I understand that dropout involves removing some activations. This is quite evident to me, looking at how “dropout_mask” is implemented in fastai. What isn’t obvious is WHY the resulting non-zero activations are rescaled using the 1 / (1 - dropout) factor !!

A small intuition is provided here: https://iamtrask.github.io/2015/07/28/dropout/

But the question still stands: is there some underlying mathematical reason why this rescaling is a must?

Thnx all…
This should be it for the night!

1 Like

I’ll try to give my best explanation. Others please feel free to correct me if I’m wrong. :slight_smile:

Let’s take an example case with dropout = 0.75.

From what we’ve understood so far, this means that we randomly drop/remove 75% of activations (or a fraction of 3/4) in a given layer. Right?

Now, in order to preserve the overall representation of the input features, such that a bird is considered a bird and not anything else, since the layer’s activations have been cut by 3/4th, the remaining activations (1/4 part) should be scaled up by a factor of 4.

Think of this as: 3/4 + 1/4 make a whole. Remove 3/4 and you need to rescale the rest by 4 to make it whole.

This means a rescaling of 4 = 1/(1 - 0.75) = 1/(1 - dropout)

This is done by default in PyTorch IIRC as mentioned by Jeremy. Hope this is clear. :slight_smile:

7 Likes

I am not sure about the PyTorch implementation, but there are two equivalent ways of going about this :slight_smile: You can either scale the remaining weights by a factor during training, or you can let the network learn those larger weights and then scale them down at test time :slight_smile:

The latter approach is I think how use of dropout is explained in most of the literature while the first approach is how it is actually implemented, at least in keras :slight_smile: What you get by scaling the weights during training is that at test time you don’t have to do anything to them (faster inference!) and also - that is key imho - you can just take the weights, do whatever you’d like to them, change dropout, remove the dropout layer… and things will just work since they are already appropriately scaled :slight_smile:

I know this tripped me to no end when I encountered this initially :slight_smile: Something worth keeping at the back of your mind while looking at the code. BTW I wonder how it is done in PyTorch - can’t look at the code ATM but maybe later unless someone beats me to it :slight_smile:

3 Likes

@A_TF57
@radek

Thanks for the response guys. I could see that the rescaling was such that the initial whole was restored. I finally hunted down how pytorch does it, and it’s exactly as implemented in the dropout_mask in fastai…

Intuitively trying to understand the rescaling though, is still a little difficult for me. Say, we have VGG-16 looking at an image of a cat. The consequence of dropout would mean that parts of the activation map would be dark! e.g. parts of the nose, eyes, or ears would be missing. Why would that essentially mean that the active pixels should be made stronger? Is it because we humans would do the same? i.e. focus harder/stronger on the pixels that remain alit.

I don’t know…(sigh). It both makes sense and doesn’t. Inherently, the neural-network shouldn’t need the rescaled weights. It ought to learn the masked representation automatically! But then, maybe that just won’t work, and empirically it was found that applying the rescaling makes it work. Maybe the original paper on dropout would be a good place to start.

Thanks for the stimulating discussions :slight_smile: :face_with_monocle:

1 Like

Applying dropout to inputs for CNNs is very rare I think. It would also probably be then referred to as adding noise to your inputs rather then dropout.

In general I think that dropout in the conv layers is quite uncommon for the reasons you mention. The rule of a thumb is that you might want to increase the amount of dropout proportionately to how far into the stack you are.

Thus the goal for dropout in CNNs might not be necessarily to blank out parts of the input, but to rather randomly drop some of the features we learn. Part of the reasoning goes along the lines: if you keep all the neurons, they will tend to learn more complex interactions that might be specific to your train set. If you start dropping them randomly, chances are that what they learn will be more robust and will be able to generalize better. There is also something quite neat I believe in the original paper that dropout can be seen as a way of ensembling exponentially many models.

In either way, I think you might be on the right track with looking at the original paper. Linking it here should anyone venture into this thread at some point and be interested :slight_smile: And yes, my memory served me well - it is one of those papers that reads like an article in a magazine :slight_smile: Quoting from the paper:

A motivation for dropout comes from a theory of the role of sex in evolution

5 Likes

@apil.tamang The reason for scaling appears to be answered quite well in the original dropout paper. Thanks @radek for linking to it, its quite fascinating :slight_smile:

At test time, it is not feasible to explicitly average the predictions from exponentially
many thinned models. However, a very simple approximate averaging method works well in
practice. The idea is to use a single neural net at test time without dropout. The weights
of this network are scaled-down versions of the trained weights. If a unit is retained with
probability p during training, the outgoing weights of that unit are multiplied by p at test
time as shown in Figure 2. This ensures that for any hidden unit the expected output (under
the distribution used to drop units at training time) is the same as the actual output at
test time. By doing this scaling, 2n networks with shared weights can be combined into
a single neural network to be used at test time.
4 Likes