Jeremy mentioned using intuition a few times today. When designing/modifying a model, is there a way to optimise the dropout rate used? Is there any reason not to train a suite of models on a sample training set, all with different dropout rates to try and find the optimum?
I have another question about dropout.
I was under the impression that dropout works a bit like an ensemble method, where the data isn’t thrown away for good, but instead is seen from different perspectives.
I thought it went like this: say you’ve got 10 inputs and you set dropout to 0.5 at each layer.
Input layer: all 10 inputs go in
Next layer: maybe it sees, say, input numbers 1, 2, 3, 4, and 9
Next layer: maybe it sees, say, input number 1, 2, 7, 8, and 9
Is this right…is it not correct that all data is available at all times?
If not, it seems like not only are you losing a tremendous amount of data extremely quickly, but the layers aren’t seeing a subset of random permutations of the full data, which is like what you’d want in something like a random forest.
At 1:27:22 on lecture 3, Jeremy says, “If you use dropout on a early layer, then you are losing that information for all the future layers.” This makes me think I’ve got it wrong.
But then the comment on this question makes me think I had it right!
EDIT!
At 1:31:09, Jeremy says, “Dropout is effectively automatically creating lots of neural networks, with different subsets of features that have been randomly selected.”
So maybe I’ve got it wrong again – this sounds like we are taking what was once one top to bottom DL network, and now we have more than one network??
The way it’s actually implemented is that, after each layer’s activation function is calculated, a binary random variable is drawn for every activation, and the output is multiplied by that. So, if there’s p=0.5 dropout after the 1st layer, then ~50% of the activations will be multiplied by zero (effectively removing them).
The idea that it approximates an ensemble method comes from this paper. Take a look and see if you find the argument compelling. It’s more of an analogy or equivalence - there isn’t an actual genuine ensemble being created.
Perhaps the best way to understand it is to read the code - it’s (as is normal in DL) very straightforward: https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py#L1104
Oh so the weight isn’t forgotten, it’s just that the output from that neuron is ignored/set to 0 this time around. Next time the weight can still be updated.
That makes sense - I was thinking the weight gets forgotten completely.
So does dropout mean that we are neglecting only the output of that particular neuron and not the information about weights ?
Yes to you both!
@jeremy I have two practical questions on dropout and wanted to get your opinion:
1- data set size: In the original dropout paper (section 7.4) they report that for small data-sets (100-500 obs), dropout does not give any improvement. I have tried to use a drop-out feed-forward architecture for the Don’t Overfit competition in Kaggle, thinking it would be a great fit to reduce dependency on uninformative features and avoid overfitting.
However, despite trying different Dense architectures and p values, I observed that including Dropout(e.g. p=0.2 input / p= 0.5 after hidden) made things worse. Do you think this is due to the small size (250 obs) of labelled train data?
From your experience Is this problem (nearly as many features as number of training data) in general not fit for neural nets ?
2- max-out networks: dropout is presented to work really well with max-out layers. I realize this is something we have not covered in the lecture series. Is it because max-out layers are now outdated? Have you had much experience with them? Any pointers?
Thanks a lot in advance.
I haven’t tried dropout on datasets that small, so can’t really comment there.
I asked Ian Goodfellow (the author of the paper) recently about maxout, and he said he still thinks it has potential, even though it hasn’t been used much.
I haven’t seen it used in successful kaggle solutions, however, which I don’t think is a good sign. I tend to stick with stuff that wins competitions, since that’s where we see state of the art results pretty much every time.
I had a related question about dropout optimization:
In the lesson 3 notebook, we remove dropout to get to a point where we overfit, then we add it again to reduce overfitting and improve performance. Is this just part of the investigation process to get to the optimal dropout rate? Or should we also follow the same process of removing then re-adding dropout when generating the final models at the ensembling phase?
In other words, once we have found the optimal dropout rate, can it harm the training of a new model to set the final dropout rate before having reached overfitting?
I am reviving this topic because I just found a great paper from 2017 on a method to optimize the dropout on the fly : concrete-dropout
As I understand it, implementation wise, it requires using their dropout layers (the main author provided a base implementation in pytorch) and adding a regularisation term to the loss.
Having the dropout optimizing itself during the training seems in line with the fastai approach.