Why apply weight decay?

I understand regularization - we penalize weights proportionally to their magnitude to prevent overfitting.

But why weight decay? If I understand it correctly, weight decay means multiplying weights at the end of each batch update by a number close to but less then 1, say 1 - 1e6.

What it is that we are hoping to achieve via applying weight decay? I tried googling for this but mostly people tend to talk to l2 regularization.


Weight decay is an additional term that causes the weights to exponentially decay to zero.

1 Like

They are mathematically identical! :slight_smile: In the last couple of ML lectures I’ve gone into this in more detail - check them out if you’re interested.


Thank you very much Jeremy! I think I can see how weight decay is equivalent to l1 regularization. I will definitely watch the lectures even if it might take me a little longer to get around to doing so :slight_smile: The information you share is pure gold but not easy for a person with a full time job to keep up especially if one wants to immediately apply what they learn. Trying to convince myself to start looking at it more like a marathon than a sprint :wink:

@ecdrid - thank you for the links, appreciate you sharing them. Both of the linked texts speak to regularization, with the 2nd one covering both l1 and l2. What I was referring to do was something related though slightly different - not adding a penalty term to the cost that we backpropagate (as in regularization), but multiplying the weights at end of each batch by some number close to 1.


Oh I know - I didn’t mean to suggest that you should have watched them already, just wanted you to know they’re there if you want to dig in further!.. Yes, learning DL is a marathon, or perhaps an ultra-marathon.


Might be a bit off the topic but still using this thread…(Sorry for that didn’t want to create another thread)

Why do we care so much about the initial weights initialisations despite the fact knowing that the architecture will learn it anyhow on basis of training and testing etc?

  • According to my understanding (karapathy said in one of his lecture series) that there is an issue with gradients flowing back through the network when we initialise them just as a constant (0) ? I understand that we might end up multiply with zero and the net will not learn anything or will take forever.(can someone share why it’s so important that we have so many different methods for weight initialisations just to improve the SoTA)

  • Also I tried to have a look at my trained weights(neural net on mnist) found them to be pretty small…
    (Max were around .5).So isn’t it better to initialise with randomness/constants?

If you initialize weights to zero… there will be no learning at all! :slight_smile: You need to break the symmetry - otherwise each weight will receive the same gradient update.

Initialization is important because it can make learning easier / harder / or in extreme cases - impossible. The issues are compounded by the depth of the architecture. With many layers and initialization that is off, the training in lower levels might not happen at all. Also, with ReLus you might have a problem of them becoming completely inactive (dead ReLus) and effectively becoming useless. With sigmoid activation, neurons can get saturated, meaning they will output values very close to either 0 or 1, and if you look at the shape of the sigmoid as it asymptotically approaches those values… it is nearly horizontal! Hence there is nearly no gradient to speak off and learning anything useful might take a very, very long time!

And we have not even started talking about recurrent architectures where we propagate through the same set of weights many, many times, and where it is very easy to encounter the disappearing / exploding gradient problem.

All this is very important and quite fascinating :slight_smile: From a practical perspective though, this is very low level and most of the good frameworks will do this for us - fastai certainly does so! :slight_smile:

If for some reason you might want to experiment with this (for instance, deeply hidden masochistic tendencies :slight_smile: ), it is super awesome that Jeremy shows us how to construct networks in PyTorch directly or even numpy! I wished I came across this information a year ago - would have made my life so much more enjoyable :slight_smile: Either way, the point that I am trying to make is this - you could experiment with not initializing the weights to anything reasonable (or set them all to zeros) - you will be surprised how quickly a network becomes untrainable :slight_smile: Those neural nets are fickle beasts and the only reason they seem like child’s play to us is because of the giants that walked before us and carried out the research and the giants who build all the tools to abstract the low level considerations away :wink:


All initialization technique as far as I know rely on randomness.

Initializing RNN weight matrices with the identity matrix is also popular!