I noticed that in Learner class, when set true_wd==True, wd means weight decay. How to set L2 regularization in such case?
My understanding of AdamW is it seperates weight decay and L2, so that they are two different parameters in AdamW. Correct me if I’m wrong.
Thanks a lot!
By default you can only use one or the other (with
False). If you want to use both at the same time, you should subclass
Thanks @sgugger. I’m not sure if wd can replace l2. Is it useful to tune both of them simutaneously?
I have no idea, I thought that was your question.
If they are two different parameters, both of them should be in the function. Now case is they are exclusive. I need to read the paper to find this. Thanks!
YOu may want to read this out… which says that why was there a need to separate the L2 reg (which was the way to perform regularization in classic SGD or Batch Gd with batch >1) . Weight decay and L2 reg were one and the same thing as long as we use Classic SGD but when we use sophisticated optimizers like Adam and SGD with momemtum there is subtle diff . Read out the link we also have a link to fastai blog explaining these things.
In short we take out (decouple) L2 term or weight decay term from Moving avg calculation which the calculation that happens while doing backward propagation and add this this term right at the time we ready to perform the Step that is weight update.
w= w-lr(moving avg.g + wd*w)
Experts can correct me if making any mistake.