How to set L2 regularization in AdamW?

heavywater · January 26, 2019, 2:04pm

I noticed that in Learner class, when set true_wd==True, wd means weight decay. How to set L2 regularization in such case?

My understanding of AdamW is it seperates weight decay and L2, so that they are two different parameters in AdamW. Correct me if I’m wrong.

Thanks a lot!

sgugger · January 26, 2019, 7:59pm

By default you can only use one or the other (with true_wd=True or False). If you want to use both at the same time, you should subclass OptimWrapper.

heavywater · January 27, 2019, 2:46am

Thanks @sgugger. I’m not sure if wd can replace l2. Is it useful to tune both of them simutaneously?

sgugger · January 27, 2019, 4:08am

I have no idea, I thought that was your question.

heavywater · January 28, 2019, 3:26am

If they are two different parameters, both of them should be in the function. Now case is they are exclusive. I need to read the paper to find this. Thanks!

champs.jaideep · April 12, 2019, 6:41am

YOu may want to read this out… which says that why was there a need to separate the L2 reg (which was the way to perform regularization in classic SGD or Batch Gd with batch >1) . Weight decay and L2 reg were one and the same thing as long as we use Classic SGD but when we use sophisticated optimizers like Adam and SGD with momemtum there is subtle diff . Read out the link we also have a link to fastai blog explaining these things.

In short we take out (decouple) L2 term or weight decay term from Moving avg calculation which the calculation that happens while doing backward propagation and add this this term right at the time we ready to perform the Step that is weight update.
w= w-lr(moving avg.g + wd*w)

Experts can correct me if making any mistake.