Taking the derivative of weight decay

galopy · October 24, 2023, 3:41am

Hi all.

I am not sure how weight decay simplifies into:

weight.grad += wd*weight

from

L = L + sum(weight**2) * wd

When I look at deep learning book by Goodfellow et al. or Papers with code, weight decay is defined as:

L = L + wd * weight.T@weight

Is weight.T@weight different from sum(weight**2)?
Because when I assume they are the same, I get:

L = L + sum(weight**2) * wd

And If I take the derivative, I have:

weight.grad = weight.grad + sum(weight)*wd

Am I taking the derivatives wrong? Or is there a way to turn sum(weight)*wd into weight*wd?

Thank you for your help.

pankaj_pansari · November 2, 2023, 11:03am

Hi,

You’re right, weight.T@weight is the same as sum(weight**2).

Let’s take a simple, tiny example. There are only two weight parameters w = [w_1, w_2]. Then, loss with weight decay (also known as L2 regularization) is:

L_new(w) = L_original(w) + (w_1^2 + w_2^2)*wd

Taking partial derivative of L_new with respect to w_1 gives us

\partial{L_new/w_1} = \partial{L_original/w_1} + 2*wd*w1

We can absorb 2 in the constant wd. Hence,

\partial{L_new/w_1} = \partial{L_original/w_1} + wd*w1

Rewriting in vector notation with weight.grad = [\partial{L_original/w_1}, \partial{L_original/w_2}]

w.grad += wd*w

updates the gradient as desired.

galopy · November 3, 2023, 3:20am

Thank you very much. I understand how they are the same thing now.