Weight decay implementation

Dooley · August 21, 2018, 7:21am

Hello,

Going deeper into the Fast.ai library, I am looking at the fit methods, and especially the optimizers, callbacks and schedules that are behind a lot of fast.ai performance and ease of use.

I have a question regarding weight decay. I see as part of the stepper that the weight decay is manually applied to weights after gradient calculation:

if 'wd' in self.opt.param_groups[0] and self.opt.param_groups[0]['wd'] != 0: 
#Weight decay out of the loss. After the gradient computation but before the step.
    for group in self.opt.param_groups:
        lr, wd = group['lr'], group['wd']
        for p in group['params']:
            if p.grad is not None: p.data = p.data.add(-wd * lr, p.data)

Isn’t it causing the weights decay to be applied twice, as when a standard Pytoorch optimizer is used (e.g. Adam), the step() method already performs weight decay (albeit with a different calculation) ?

Probably a stupid mistake on my sie, clarifications are very welcome!

Thanks

sgugger · August 21, 2018, 2:06pm

Hi there.

This is the way we picked to apply weight decay when we don’t want it done inside the optimizer (see here why). The optimizer has a key named ‘weight_decay’, not ‘wd’. So at this step, we check if there is an extra key ‘wd’ and apply the weight decay before the optimizer messes with it.

Of course, it should be applied with ‘weight_decay’ put to 0 otherwise that would be applied twice as you mention.

jeremy · August 21, 2018, 2:09pm

(I’ve moved this from #fastai-dev, since that category is just for dev of the new v1 library.)