@rachel do you mean if <.5 will be zero and 1 if p>.5? Otherwise both read p=.5
Ctrl+b, z
Right, so instead of reducing the weights at test time you make them bigger at training time if I understood your explanation ?
isn’t cat columns already sparse, applying dropout may have a risk of not learning enough for those embedding.
I was saying that it will be 0 with probability 50% (which is equivalent to choosing a u~uniform(0,1) and checking if u < .5, which is what you are saying now).
Yes. In pytorch (unless I remember wrong) dropout doesn’t do anything to your weights at all during test time.
In what proportion would you use dropout vs. other regularization errors, like, weight decay, L2 norms, etc.?
Balancing them all properly is all the art of a good practitioner. Sadly we don’t have any guidance on that except get your intuition on it.
can any one explain emb drop out… is it different from term normal drop out… .i missed on this part…
Use RMSE if you want to minimize absolute error.
Use RMSPE if you want to minimize fractional error.
In many cases, we are using whichever loss was used by the Kaggle competition or academic benchmark that we are comparing our results against.
It just drops out some activations of the embedding. Remember that you can treat an embedding like a multiplication of an embedding matrix times a matrix with one hot encoded vectors of ones. So after getting the embedding, you loose some of the embedding values for each feature.
In a two-class problem, say classification of cat or dog,
our model will assign a probability of being a cat (Jeremy calls it “cattyness”)
and a probability of being a dog (“doggyness”). Of course since there are only two classes, the class label is either 1 for cat or 0 for dog. And cattyness = 1 - doggyness.
thanku
how different is it from normal drop out term… p
Thank you. Any intuitions (like in what situations) when minimizing absolute is better than percentage and vice versa?
It applies to the embeddings instead of the hidden layers. The difference is in which layers are affected.
I’m not so sure that dropping inputs is a bad idea! It is akin to building trees from random subsets of the samples in Random Forest.
Could we get a quick overview of weight norm? (I think that’s what Jeremy said)
Yes, you only need to use dropouts to regularize your model.
That’s advanced (which is why he didn’t go through the explanation)