Why do we half all the FC layer weights to remove Dropout instead of just the layers after Dropouts?

radek · February 28, 2017, 10:45pm

If you have dropout in layer n, what it does is it removes, or zeroes out, some portion of nodes in the preceding layer (n-1). ). 0.5 is a nice value to consider since it is easy to imagine half of the nodes in previous layer getting removed from a given calculation.

If you remove half the nodes, this doesn’t do anything to the target. Meaning, if in layer n + 1, directly following the dropout layer, you have just a single node, and the target value of that node for a given example is 1, than the network will be learning to produce 1 on that example with just half the nodes in the previous layer available. If, on average, we will have half the nodes in n - 1, than the weights going from n - 1 to n will have to be twice as big in order output the 1 we are after.

If then we remove the dropout, for any given example we will have twice as many nodes available in layer n - 1. That means, that on average, each weight connecting to n + 1 will only need to have half its original magnitude to achieve an equivalent result to the one under drop out.

That is the reasoning and it holds assuming dropout is implemented as in what I believe was the original paper on it. However, it turns out that this is not exactly how dropout was implemented in keras. Because keras adjusts the weights appropriately at train time, no further changes need to happen if you increase / decrease dropout. Quite convenient if you have a need for moving weights around - you simply don’t have to worry about the rescaling.

I experimented with this a little bit and it took me a while to get to the bottom of this - if you feel you would like to read about this a bit more here is the original thread.