TL;DR For anyone running into the same question, here is what is going on with dropout rescaling. While Jeremy does in fact say in Lesson 12 that dividing by
1-p ensures that the standard deviation is the same after dropout compared to before dropout, I think he misspoke, because this is not true. Inverted Dropout ensures that the activations have the same mean compared to before Dropout, but not the same standard deviation. I run a lot of tests and it turns out that having the same mean is much more important than having the same standard deviation, so that’s why everyone does it like that.
12a_awd_lstm.ipynb notebook, the
dropout_mask is defined by using a bernoulli trial and dividing the results by
1-p, effectively upscaling the non-dropped activations.
def dropout_mask(x, sz, p): return x.new(*sz).bernoulli_(1-p).div_(1-p)
Jeremy explains in the video for Lesson 12 that this is to keep the standard deviation constant, but in my tests that is not the case. I found that dividing by
1-p was increasing the standard deviation from 1.0 for a
0.1 by about 5%, for a
0.25 by about 15% and for a
0.5 by about 41%.
This is how I ran the tests:
n = 10000 x = torch.randn(n, n) mask = dropout_mask(x, (n, n), 0.5) d = (x*mask) x.mean(), x.std(), d.mean(), d.std()
I have no idea how much of an effect this could possibly have or why this happens, but I think it is strange and maybe worth investigating. If anyone know the explanation for this, I would appreciate it a lot.