# Optimising dropout

Jeremy mentioned using intuition a few times today. When designing/modifying a model, is there a way to optimise the dropout rate used? Is there any reason not to train a suite of models on a sample training set, all with different dropout rates to try and find the optimum?

2 Likes

Thatâ€™s exactly the right approach!

1 Like

I have another question about dropout.

I was under the impression that dropout works a bit like an ensemble method, where the data isnâ€™t thrown away for good, but instead is seen from different perspectives.

I thought it went like this: say youâ€™ve got 10 inputs and you set dropout to 0.5 at each layer.

Input layer: all 10 inputs go in
Next layer: maybe it sees, say, input numbers 1, 2, 3, 4, and 9
Next layer: maybe it sees, say, input number 1, 2, 7, 8, and 9

Is this rightâ€¦is it not correct that all data is available at all times?

If not, it seems like not only are you losing a tremendous amount of data extremely quickly, but the layers arenâ€™t seeing a subset of random permutations of the full data, which is like what youâ€™d want in something like a random forest.

At 1:27:22 on lecture 3, Jeremy says, â€śIf you use dropout on a early layer, then you are losing that information for all the future layers.â€ť This makes me think Iâ€™ve got it wrong.

But then the comment on this question makes me think I had it right!

EDIT!

At 1:31:09, Jeremy says, â€śDropout is effectively automatically creating lots of neural networks, with different subsets of features that have been randomly selected.â€ť

So maybe Iâ€™ve got it wrong again â€“ this sounds like we are taking what was once one top to bottom DL network, and now we have more than one network??

The way itâ€™s actually implemented is that, after each layerâ€™s activation function is calculated, a binary random variable is drawn for every activation, and the output is multiplied by that. So, if thereâ€™s p=0.5 dropout after the 1st layer, then ~50% of the activations will be multiplied by zero (effectively removing them).

The idea that it approximates an ensemble method comes from this paper. Take a look and see if you find the argument compelling. Itâ€™s more of an analogy or equivalence - there isnâ€™t an actual genuine ensemble being created.

Perhaps the best way to understand it is to read the code - itâ€™s (as is normal in DL) very straightforward: https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py#L1104

1 Like

Oh so the weight isnâ€™t forgotten, itâ€™s just that the output from that neuron is ignored/set to 0 this time around. Next time the weight can still be updated.

That makes sense - I was thinking the weight gets forgotten completely.

So does dropout mean that we are neglecting only the output of that particular neuron and not the information about weights ?

Yes to you both!

@jeremy I have two practical questions on dropout and wanted to get your opinion:

1- data set size: In the original dropout paper (section 7.4) they report that for small data-sets (100-500 obs), dropout does not give any improvement. I have tried to use a drop-out feed-forward architecture for the Donâ€™t Overfit competition in Kaggle, thinking it would be a great fit to reduce dependency on uninformative features and avoid overfitting.
However, despite trying different Dense architectures and p values, I observed that including Dropout(e.g. p=0.2 input / p= 0.5 after hidden) made things worse. Do you think this is due to the small size (250 obs) of labelled train data?

From your experience Is this problem (nearly as many features as number of training data) in general not fit for neural nets ?

2- max-out networks: dropout is presented to work really well with max-out layers. I realize this is something we have not covered in the lecture series. Is it because max-out layers are now outdated? Have you had much experience with them? Any pointers?

I havenâ€™t tried dropout on datasets that small, so canâ€™t really comment there.

I asked Ian Goodfellow (the author of the paper) recently about maxout, and he said he still thinks it has potential, even though it hasnâ€™t been used much.

I havenâ€™t seen it used in successful kaggle solutions, however, which I donâ€™t think is a good sign. I tend to stick with stuff that wins competitions, since thatâ€™s where we see state of the art results pretty much every time.

1 Like