# Weight decay vs LR

Hi I’m quite lost with weight decay and I would like to know how WD and LR work together and how is it that weight decay affect loss ? Isn’t loss , simply put just the diff between our actual and predicted data ? How does reducing our weight affect loss ? Sometimes we would need our weight to be increased for the loss to be better right ?

Also how does WD compare to LR

The goal of weight-decay (or most regularization) is not directly about reducing loss* but more specifically about attempting to avoid overfitting and therefore trying to develop a model that generalizes better to unseen data.

Are you asking about learning rate or L2 regularization for “LR”? As mentioned in this forum post it’s the same as L2 regularization. I won’t rehash @radek’s lovely explanation from that post though.

*although it’s a by product of well-trained model

Hi by LR i meant Learning Rate. With weight decay we multiply the weights with a constant so as to prevent the weight from getting too large yes ? But i don’t get these parts :

loss_with_wd = loss + wd * (weighst**2).sum() (this is our new loss function)

Then we plot the graph of loss and its corresponding weights:
And we find the gradient of a point on that curve, isnt the gradient supposed to be gradient of loss and not gradient of weight? since for y = 2x when we do gradient of this curve we
differentiate y with respect to x we get dy/dx so shouldn’t it be
loss.grad += wd * 2 * parameters and not weights.grad += wd * 2 * parameters ?

Also just to tie things u[p, the learning rate is multiplied to the value above and that is our step right ?

I’ve asked the same before : )

Yijin

1 Like

Cheers man I’ll copy the answer here

Great question, Yijin! In principle you are correct. But `PyTorch` uses a slick notation trick:

`weight.grad` implictly calculates the derivative of the loss function with respect to weight.

However I don’t quite get why we cant just use loss.grad ? By doing weight.grad arent we differentiating grad wrt to loss function ? or does weight.grad return us the value of the derivative of the loss function with respect to weight only ?

Once again thanks man

I have to admit that I have never really looked at the actual PyTorch code syntax / nomenclature / notation… =P

My understanding is just that:

``````loss_with_weight_decay = loss + (wdhyperparam * params^2)
``````

So when differentiating both sides w.r.t. params, it’s equivalent to:

``````loss.grad += 2 * wdhyperparam * params
``````

And since `wdhyperparam` is just a hyperparameter that you choose, it can be grouped with the `2x` to just give hyperparam `wd`. So effectively you just add `wd * sum(params)` to your original gradient, which then gets multiplied with learning rate for your next param update.

Someone else will have to help and give a more detailed explanation about `weight.grad` vs. `loss.grad` in PyTorch… Thanks.

Yijin

1 Like

cheers man for some reason I didnt get a notif on this thread, thank you for your work.