Weight decay in lesson 3

In the lesson 3 notebook there is a value, wd that is passed into the learner. A quick look at the docs shows that this is for weight decay. In the lecture, there was no mention of this. Why was this added and what benefit does it provide to training?

Weight decay helps reduce overfitting. It will be discussed in more detail in part 2.

It is a form of regularization, the goal of regularization is to lean a simpler version of the model (“Suppose there exist two explanations for an occurrence. In this case the simpler one is usually better”), thus giving us a model more capable of generalizing better to unseen data (and reduce the possibility of over-fitting).

In weight decay, at each iteration of gradient update we also add to the loss the sum of the norms (or squares) of the weights multiplied by a constant : |weights| * lambda (parameter passed), this way we force the value of the weights to stay small (depending of the value of lambda).


Lesson 5 (video from 1:12:15) explains this in more detail.
But usually it is not importatnt to understand all concepts right away and jump ahead in the course, better follow along chronologically if viewing for the first time.
If it is important, Jeremy will explain it eventually! :wink:

1 Like

Thanks for the timestamp… I understand, but I wanted to apply the segmentation code to a different dataset and wanted to know if it is applicable. I applied wd and noticed the learning-rate finder plot was “nicer”. I will see if I can replicate those results and post.