galopy
October 24, 2023, 3:41am
1
Hi all.

I am not sure how weight decay simplifies into:

```
weight.grad += wd*weight
```

from

```
L = L + sum(weight**2) * wd
```

When I look at deep learning book by Goodfellow et al. or Papers with code , weight decay is defined as:

```
L = L + wd * weight.T@weight
```

Is `weight.T@weight`

different from `sum(weight**2)`

?
Because when I assume they are the same, I get:

```
L = L + sum(weight**2) * wd
```

And If I take the derivative, I have:

```
weight.grad = weight.grad + sum(weight)*wd
```

Am I taking the derivatives wrong? Or is there a way to turn `sum(weight)*wd`

into `weight*wd`

?

Thank you for your help.

Hi,

You’re right, `weight.T@weight`

is the same as `sum(weight**2)`

.

Let’s take a simple, tiny example. There are only two weight parameters `w = [w_1, w_2]`

. Then, loss with weight decay (also known as L2 regularization) is:

`L_new(w) = L_original(w) + (w_1^2 + w_2^2)*wd`

Taking partial derivative of `L_new`

with respect to `w_1`

gives us

`\partial{L_new/w_1} = \partial{L_original/w_1} + 2*wd*w1`

We can absorb 2 in the constant `wd`

. Hence,

`\partial{L_new/w_1} = \partial{L_original/w_1} + wd*w1`

Rewriting in vector notation with `weight.grad = [\partial{L_original/w_1}, \partial{L_original/w_2}]`

`w.grad += wd*w`

updates the gradient as desired.

1 Like

galopy
November 3, 2023, 3:20am
3
Thank you very much. I understand how they are the same thing now.