galopy
October 24, 2023, 3:41am
1
Hi all.
I am not sure how weight decay simplifies into:
weight.grad += wd*weight
from
L = L + sum(weight**2) * wd
When I look at deep learning book by Goodfellow et al. or Papers with code , weight decay is defined as:
L = L + wd * weight.T@weight
Is weight.T@weight
different from sum(weight**2)
?
Because when I assume they are the same, I get:
L = L + sum(weight**2) * wd
And If I take the derivative, I have:
weight.grad = weight.grad + sum(weight)*wd
Am I taking the derivatives wrong? Or is there a way to turn sum(weight)*wd
into weight*wd
?
Thank you for your help.
Hi,
You’re right, weight.T@weight
is the same as sum(weight**2)
.
Let’s take a simple, tiny example. There are only two weight parameters w = [w_1, w_2]
. Then, loss with weight decay (also known as L2 regularization) is:
L_new(w) = L_original(w) + (w_1^2 + w_2^2)*wd
Taking partial derivative of L_new
with respect to w_1
gives us
\partial{L_new/w_1} = \partial{L_original/w_1} + 2*wd*w1
We can absorb 2 in the constant wd
. Hence,
\partial{L_new/w_1} = \partial{L_original/w_1} + wd*w1
Rewriting in vector notation with weight.grad = [\partial{L_original/w_1}, \partial{L_original/w_2}]
w.grad += wd*w
updates the gradient as desired.
1 Like
galopy
November 3, 2023, 3:20am
3
Thank you very much. I understand how they are the same thing now.