TL;DR For anyone running into the same question, here is what is going on with dropout rescaling. While Jeremy does in fact say in Lesson 12 that dividing by
1-p ensures that the standard deviation is the same after dropout compared to before dropout, I think he misspoke, because this is not true. Inverted Dropout ensures that the activations have the same mean compared to before Dropout, but not the same standard deviation. I run a lot of tests and it turns out that having the same mean is much more important than having the same standard deviation, so that’s why everyone does it like that.
12a_awd_lstm.ipynb notebook, the
dropout_mask is defined by using a bernoulli trial and dividing the results by
1-p, effectively upscaling the non-dropped activations.
def dropout_mask(x, sz, p):
Jeremy explains in the video for Lesson 12 that this is to keep the standard deviation constant, but in my tests that is not the case. I found that dividing by
1-p was increasing the standard deviation from 1.0 for a
0.1 by about 5%, for a
0.25 by about 15% and for a
0.5 by about 41%.
This is how I ran the tests:
n = 10000
x = torch.randn(n, n)
mask = dropout_mask(x, (n, n), 0.5)
d = (x*mask)
x.mean(), x.std(), d.mean(), d.std()
I have no idea how much of an effect this could possibly have or why this happens, but I think it is strange and maybe worth investigating. If anyone know the explanation for this, I would appreciate it a lot.
I am starting to understand where my confusion comes from. Jeremy indeed says in the Lesson 12 at 01:57:18 that (inverted) Dropout rescaling is done to achieve a standard deviation of one. However, in the original dropout paper (and in other sources online) the rescaling is explained with the objective to make “sure that for each unit, the expected output from it under random dropout will be the same as the output during pretraining”. Now I’m not saying that doesn’t effectively mean the same standard deviation, but it also doesn’t say so explicitly. So I simply don’t know what exactly they mean.
Here is an example of another source that says rescaling is done so that the next layer doesn’t get a “lesser value”. That sounds like the same mean to me, but it - again - isn’t mathematically clear. If had yet to see a proper mathematical explanation of what rescaling with 1/p does exactly. Maybe it is so simple and obvious to everyone else, nobody bothers
Lets say we have a layer where 100 activations go in and 1 activation goes out. So our input is a vector of size
v_in = (100,), our weight matrix is of size
W = (100, 1) and our output is of size
v_out = (1,).
Lets say for the sake of argument, that each activation and weight is equal to 1, so
v_in@W = 100.
We apply a dropout mask with
p = 0.3, so during training we set 30% of our weights to 0. Now we have a problem. With dropout,
v_in@(W*mask) = 70. Without dropout,
v_in@W = 100. The dropout mask affects the scale of the activations. We want the model experience the same magnitude of activations during training and inference.
The way Pytorch deals with this scaling the activations by
1/(1-p) during training. So during training, we have
v_in@(W*mask) * 1/(1-0.3) = 70 / 0.7 = 100 and the scale is the same as the computation without dropout. Equivalently, we could choose to not scale during training and instead to scale by
(1-p) during inference (which is what the dropout paper does).
Additionally, we need to be careful of notation. In Pytorch, we use
p to denote the probability of dropping an activation. So if we have
p=0.3, we drop out 30% of our activations. Another notation is to use
p to denote the probability of keeping an activation. This is the notation used by the dropout paper. With that notation,
p=0.3 means you keep 30% of your activations and drop out 70%. So when Pytorch rescales by
1/(1-p) and the dropout paper rescales by
1/p, they’re doing the same thing, just with a different definition of
Thanks for the explanation, Karl. This is my understanding, too, but that means it really has nothing to do with either means or standard deviation, but simple with the fact that the exact same input should result in the same activations in training and inference, right?
I guess where I am also not clear is how Dropout has any effect in your example. If the one activation is 100, with or without dropout, how does dropout have any effect? My guess would be in the backward pass, but not sure.
Of course it’s related to the means and standard deviations. Multiplying by a constant scales your mean and standard deviation. Scaling is just done to ensure that the magnitude of activations (and their mean/standard deviation) that the model sees during training are the same as during testing. If you didn’t scale, your model would behave wildly differently during training vs testing.
You should read more on the mechanism of dropout. The purpose of dropout isn’t to alter the scale of the activations of your model. It’s to force the model to learn to use different combinations of activations/weights to reach the same outcomes. This is explained in the abstract of the dropout paper.
FYI, inverted dropout scaling by
1/1-p achieves about the same mean, but not the same standard deviation (if you start to think about the formula for standard deviation, it makes a lot of sense that scaling the post-dropout tensor would result in a higher standard deviation compared to before dropout). However, it turns out that achieving the same mean is way more important than achieving the same standard deviation (which would be accomplished by scaling by (1/sqrt(1-p)).