Is weight decay applied to the bias term?

I generally don’t only because the bias is not multiplied by an input, and simply acts as a way to shift the output. It can be legitimate for it to be a fairly large number, as it puts less pressure on the weights to model “shift” of the activations.
x=[1,2,3]
y=[12,13,14]
mx+b=y
m = 1
b = 11

Without bias…
x=[1,2,3]
y=[12,13,14]
mx=y
m = 13/2=~6.5

b=1
x=[1,2,3]
y=[12,13,14]
mx+1=y
m = (13-1)/2=~6

b=10
x=[1,2,3]
y=[12,13,14]
mx+10=y
m = (13-10)/2=~1.5

So I have always thought of bias as a term that was mostly there to allow your weights to be smaller, otherwise without bias the weights might have to be fairly large, which is exactly what you are trying to avoid with weight decay.

This becomes more complicated with matrix multiplies, and normalization though…

2 Likes