Why apply weight decay?

If you initialize weights to zero… there will be no learning at all! :slight_smile: You need to break the symmetry - otherwise each weight will receive the same gradient update.

Initialization is important because it can make learning easier / harder / or in extreme cases - impossible. The issues are compounded by the depth of the architecture. With many layers and initialization that is off, the training in lower levels might not happen at all. Also, with ReLus you might have a problem of them becoming completely inactive (dead ReLus) and effectively becoming useless. With sigmoid activation, neurons can get saturated, meaning they will output values very close to either 0 or 1, and if you look at the shape of the sigmoid as it asymptotically approaches those values… it is nearly horizontal! Hence there is nearly no gradient to speak off and learning anything useful might take a very, very long time!

And we have not even started talking about recurrent architectures where we propagate through the same set of weights many, many times, and where it is very easy to encounter the disappearing / exploding gradient problem.

All this is very important and quite fascinating :slight_smile: From a practical perspective though, this is very low level and most of the good frameworks will do this for us - fastai certainly does so! :slight_smile:

If for some reason you might want to experiment with this (for instance, deeply hidden masochistic tendencies :slight_smile: ), it is super awesome that Jeremy shows us how to construct networks in PyTorch directly or even numpy! I wished I came across this information a year ago - would have made my life so much more enjoyable :slight_smile: Either way, the point that I am trying to make is this - you could experiment with not initializing the weights to anything reasonable (or set them all to zeros) - you will be surprised how quickly a network becomes untrainable :slight_smile: Those neural nets are fickle beasts and the only reason they seem like child’s play to us is because of the giants that walked before us and carried out the research and the giants who build all the tools to abstract the low level considerations away :wink:

8 Likes