Hi @all @jeremy
Noticed something very interesting about how dropout are applied. I’ve actually never looked at the source for dropout, and can’t seem to dig out the exact source (some .py file) where the logic for it is executed. It seems like this is done in some compiled CuDNN code.
I understand that dropout involves removing some activations. This is quite evident to me, looking at how “dropout_mask” is implemented in fastai. What isn’t obvious is WHY the resulting non-zero activations are rescaled using the 1 / (1 - dropout) factor !!
A small intuition is provided here: https://iamtrask.github.io/2015/07/28/dropout/
But the question still stands: is there some underlying mathematical reason why this rescaling is a must?
This should be it for the night!
I’ll try to give my best explanation. Others please feel free to correct me if I’m wrong.
Let’s take an example case with
dropout = 0.75.
From what we’ve understood so far, this means that we randomly drop/remove 75% of activations (or a fraction of
3/4) in a given layer. Right?
Now, in order to preserve the overall representation of the input features, such that a bird is considered a bird and not anything else, since the layer’s activations have been cut by
3/4th, the remaining activations (
1/4 part) should be scaled up by a factor of
Think of this as:
3/4 + 1/4 make a whole. Remove
3/4 and you need to rescale the rest by
4 to make it whole.
This means a rescaling of
4 = 1/(1 - 0.75) = 1/(1 - dropout)
This is done by default in PyTorch IIRC as mentioned by Jeremy. Hope this is clear.
Thanks for the response guys. I could see that the rescaling was such that the initial whole was restored. I finally hunted down how pytorch does it, and it’s exactly as implemented in the dropout_mask in fastai…
Intuitively trying to understand the rescaling though, is still a little difficult for me. Say, we have VGG-16 looking at an image of a cat. The consequence of dropout would mean that parts of the activation map would be dark! e.g. parts of the nose, eyes, or ears would be missing. Why would that essentially mean that the active pixels should be made stronger? Is it because we humans would do the same? i.e. focus harder/stronger on the pixels that remain alit.
I don’t know…(sigh). It both makes sense and doesn’t. Inherently, the neural-network shouldn’t need the rescaled weights. It ought to learn the masked representation automatically! But then, maybe that just won’t work, and empirically it was found that applying the rescaling makes it work. Maybe the original paper on dropout would be a good place to start.
Thanks for the stimulating discussions
Applying dropout to inputs for CNNs is very rare I think. It would also probably be then referred to as adding noise to your inputs rather then dropout.
In general I think that dropout in the conv layers is quite uncommon for the reasons you mention. The rule of a thumb is that you might want to increase the amount of dropout proportionately to how far into the stack you are.
Thus the goal for dropout in CNNs might not be necessarily to blank out parts of the input, but to rather randomly drop some of the features we learn. Part of the reasoning goes along the lines: if you keep all the neurons, they will tend to learn more complex interactions that might be specific to your train set. If you start dropping them randomly, chances are that what they learn will be more robust and will be able to generalize better. There is also something quite neat I believe in the original paper that dropout can be seen as a way of ensembling exponentially many models.
In either way, I think you might be on the right track with looking at the original paper. Linking it here should anyone venture into this thread at some point and be interested And yes, my memory served me well - it is one of those papers that reads like an article in a magazine Quoting from the paper:
A motivation for dropout comes from a theory of the role of sex in evolution
@apil.tamang The reason for scaling appears to be answered quite well in the original dropout paper. Thanks @radek for linking to it, its quite fascinating
At test time, it is not feasible to explicitly average the predictions from exponentially
many thinned models. However, a very simple approximate averaging method works well in
practice. The idea is to use a single neural net at test time without dropout. The weights
of this network are scaled-down versions of the trained weights. If a unit is retained with
probability p during training, the outgoing weights of that unit are multiplied by p at test
time as shown in Figure 2. This ensures that for any hidden unit the expected output (under
the distribution used to drop units at training time) is the same as the actual output at
test time. By doing this scaling, 2n networks with shared weights can be combined into
a single neural network to be used at test time.