Is weight decay applied to the bias term?

wgpubs · June 13, 2020, 1:27am

Was looking at some of the huggingface optimzer/schedulers and noticed that they use parameter groups to exclude weight decay from being applied to both LayerNorm weights and the bias term of all parameters.

Does this happen in v2? Either way, would be curious to know the rational of applying it or not applying it to the bias term.

Thanks

Pomo · June 13, 2020, 11:34pm

I thought this would be an easy question, penalizing the norm of weights, blah blah blah. But it turns out to be a rabbit hole. Good luck!

muellerzr · June 14, 2020, 12:38am

Here’s a bit I’ve found:

To apply different hyper-parameters to different groups (differential learning rates, or no weight decay for certain layers for instance), you will need to adjust those values after the init.

Otherwise, I believe you can?

Here’s another bit I’ve read:

Parameters such as batchnorm weights/bias can be marked to always be in training mode, just put force_train=true in their state.

With the tests:

params = [tst_params(), tst_params(), tst_params()]
opt = Optimizer(params, sgd_step, lr=0.1)
for p in L(params[1])[[1,3]]: opt.state[p] = {'force_train': True}
opt.freeze()
test_eq(L(params[0]).map(req_grad), [False]*4)
test_eq(L(params[1]).map(req_grad), [False, True, False, True])
test_eq(L(params[2]).map(req_grad), [True]*4)

Maybe this is a place to start looking? (This is 12_optimizer)

And here is exactly what you want I think:

def create_opt(self):
        self.opt = self.opt_func(self.splitter(self.model), lr=self.lr)
        if not self.wd_bn_bias:
            for p in self._bn_bias_state(True ): p['do_wd'] = False

There is a parameter in fit called wd_bn_bias=False, along with train_bn=True

(This one was found in Learner)

I’m not 100% sure if that’s what you’re looking for, trying to understand the rabbit hole myself, but these should be the related bits

wgpubs · June 14, 2020, 1:25am

Yah … funny, I was just going through this same code about an hour or so ago.

Whereas in the huggingface snippet they use parameter groups to isolate the bias parameters, fastai looks like it does this via their own custom “steppers” which are called in their custom optimizer step() method (see here in the docs)

The pros with the later (fastai approach) is that the parameter groups can then be used solely for differential learning rates whereas the former make it difficult to do so (e.g., you would have to do something like create two parameter groups for every one real parameter group you’d want to create, one that uses weight decay for the params that need it and one for things like bias that shouldn’t have weight decay).

Pomo · June 19, 2020, 6:07pm

Following up here after a few days…

I see from the replies that there are many ways to apply weight decay to bias (or not). But, going back to the original question, is there a reason we would want to decay the bias?

marii · June 20, 2020, 1:49am

I generally don’t only because the bias is not multiplied by an input, and simply acts as a way to shift the output. It can be legitimate for it to be a fairly large number, as it puts less pressure on the weights to model “shift” of the activations.
x=[1,2,3]
y=[12,13,14]
mx+b=y
m = 1
b = 11

Without bias…
x=[1,2,3]
y=[12,13,14]
mx=y
m = 13/2=~6.5

b=1
x=[1,2,3]
y=[12,13,14]
mx+1=y
m = (13-1)/2=~6

b=10
x=[1,2,3]
y=[12,13,14]
mx+10=y
m = (13-10)/2=~1.5

So I have always thought of bias as a term that was mostly there to allow your weights to be smaller, otherwise without bias the weights might have to be fairly large, which is exactly what you are trying to avoid with weight decay.

This becomes more complicated with matrix multiplies, and normalization though…

DanielLam · June 20, 2020, 6:43am

Someone should double check this, but I think fastai2 applies it all the weights and biases. I was digging through a while back, and saw all the model parameters(weights+biases) got updated with the decay. I saw “for p in model.parameters()”. etc. etc.

Conventionally, weight decay (l2 regularization) is applied only to the weights. You can add it to the bias term as well, but just applying on the weights seems good enough for regularization.

Source: Andrew Ng regularization lecture