Weight Decay is different from Learning Rate / No of Epochs.
It’s a regularization coefficient (part of a regulartization term) that we add to the loss, as a penalty on the model. This penalty’s job is to control the complexity of the model (size of coeffs/number of coeffs in some cases) to encourage the model not to be over-complex and overfit (memorize) the training data.
Or no need for a AWS p3
any more likes to this…
THis can be a very useful thing where u have lot of noisy background not of your interest and very small area of pixels of our interest…just throwing my thought not sure if it can be done or not so seeking calrifications
arnold needed to come to life. he expects nothing less than a p3.
How wd
parameter is related to L1/L2 regularization norm? From the formula that was shown, it looks like coefficient multiplied by L2, right?
my question
I have a weight decay question…
Does the amount of weight decay increase proportionally to the size of the network? i.e. will a network with more parameters have more weight decay, and do you need to think about that when you specify it?
PCA can be applied to image data. There are some examples in the computational linear algebra course, for instance, using PCA to identify the foreground vs. background in a video.
No it doesn’t. Weight decay applies to each specific weight when updating so it is agnostic to the number of parameters.
That’s not the same thing, it’s a parameter that reduces the learning rate over time.
A larger batch will yield more accurate (less noisy) weight updates. The tradeoff is you get fewer weight updates per epoch. On the other hand, a small batch will yield noisier updates, but you get more updates per epoch.
wd plays the same role in L1/L2 regularization norm. The difference is in the summation of the weights - which is multiplied by wd - (L1 -> sums absolute values of weights while L2 -> sums square values of weights).
That’s learning rate decay, which is different from weigh decay. Learning rate decay is reducing the learning rate over several epochs. Weight decay affects the update of parameters during back propagation.
Okay. Thanks for clearing the difference
Is something like “differential weight decay” like we do for learning rates make any sense ?
Definitely. Leslie Smith is working on it, please check my previous post’s link.
nice! this is the type of explanation i was looking for!
“fewer weight updates per epoch” thus needing more epochs, right? as was said by @sgugger ?
EDIT: punctuation
Thanks for clearing the difference between weight decay and learning rate decay