Hi I’m quite lost with weight decay and I would like to know how WD and LR work together and how is it that weight decay affect loss ? Isn’t loss , simply put just the diff between our actual and predicted data ? How does reducing our weight affect loss ? Sometimes we would need our weight to be increased for the loss to be better right ?

The goal of weight-decay (or most regularization) is not directly about reducing loss* but more specifically about attempting to avoid overfitting and therefore trying to develop a model that generalizes better to unseen data.

Are you asking about learning rate or L2 regularization for “LR”? As mentioned in this forum post it’s the same as L2 regularization. I won’t rehash @radek’s lovely explanation from that post though.

Hi by LR i meant Learning Rate. With weight decay we multiply the weights with a constant so as to prevent the weight from getting too large yes ? But i don’t get these parts :

loss_with_wd = loss + wd * (weighst**2).sum() (this is our new loss function)

Then we plot the graph of loss and its corresponding weights:
And we find the gradient of a point on that curve, isnt the gradient supposed to be gradient of loss and not gradient of weight? since for y = 2x when we do gradient of this curve we
differentiate y with respect to x we get dy/dx so shouldn’t it be loss.grad += wd * 2 * parameters and not weights.grad += wd * 2 * parameters ?

Also just to tie things u[p, the learning rate is multiplied to the value above and that is our step right ?

Great question, Yijin! In principle you are correct. But PyTorch uses a slick notation trick:

weight.gradimplictly calculates the derivative of the loss function with respect to weight.

However I don’t quite get why we cant just use loss.grad ? By doing weight.grad arent we differentiating grad wrt to loss function ? or does weight.grad return us the value of the derivative of the loss function with respect to weight only ?

I have to admit that I have never really looked at the actual PyTorch code syntax / nomenclature / notation… =P

My understanding is just that:

loss_with_weight_decay = loss + (wdhyperparam * params^2)

So when differentiating both sides w.r.t. params, it’s equivalent to:

loss.grad += 2 * wdhyperparam * params

And since wdhyperparam is just a hyperparameter that you choose, it can be grouped with the 2x to just give hyperparam wd. So effectively you just add wd * sum(params) to your original gradient, which then gets multiplied with learning rate for your next param update.

Someone else will have to help and give a more detailed explanation about weight.grad vs. loss.grad in PyTorch… Thanks.