Lesson 5 In-Class Discussion ✅

Weight Decay is different from Learning Rate / No of Epochs.
It’s a regularization coefficient (part of a regulartization term) that we add to the loss, as a penalty on the model. This penalty’s job is to control the complexity of the model (size of coeffs/number of coeffs in some cases) to encourage the model not to be over-complex and overfit (memorize) the training data.

3 Likes

Or no need for a AWS p3

1 Like

any more likes to this…
THis can be a very useful thing where u have lot of noisy background not of your interest and very small area of pixels of our interest…just throwing my thought not sure if it can be done or not so seeking calrifications

arnold needed to come to life. he expects nothing less than a p3.

How wd parameter is related to L1/L2 regularization norm? From the formula that was shown, it looks like coefficient multiplied by L2, right?

4 Likes

my question :slight_smile:

Almost, but not exactly the same thing, as explained in this article.

2 Likes

I have read this DLCV book . It is mentioned there. Screenshot attached

I have a weight decay question…

Does the amount of weight decay increase proportionally to the size of the network? i.e. will a network with more parameters have more weight decay, and do you need to think about that when you specify it?

5 Likes

PCA can be applied to image data. There are some examples in the computational linear algebra course, for instance, using PCA to identify the foreground vs. background in a video.

10 Likes

No it doesn’t. Weight decay applies to each specific weight when updating so it is agnostic to the number of parameters.

1 Like

That’s not the same thing, it’s a parameter that reduces the learning rate over time.

2 Likes

A larger batch will yield more accurate (less noisy) weight updates. The tradeoff is you get fewer weight updates per epoch. On the other hand, a small batch will yield noisier updates, but you get more updates per epoch.

2 Likes

wd plays the same role in L1/L2 regularization norm. The difference is in the summation of the weights - which is multiplied by wd - (L1 -> sums absolute values of weights while L2 -> sums square values of weights).

1 Like

That’s learning rate decay, which is different from weigh decay. Learning rate decay is reducing the learning rate over several epochs. Weight decay affects the update of parameters during back propagation.

4 Likes

Okay. Thanks for clearing the difference

Is something like “differential weight decay” like we do for learning rates make any sense ?

1 Like

Definitely. Leslie Smith is working on it, please check my previous post’s link.

3 Likes

nice! this is the type of explanation i was looking for!

“fewer weight updates per epoch” thus needing more epochs, right? as was said by @sgugger ?

EDIT: punctuation

Thanks for clearing the difference between weight decay and learning rate decay