Lesson 5 In-Class Discussion ✅

saadorj · November 20, 2018, 4:07am

Weight Decay is different from Learning Rate / No of Epochs.
It’s a regularization coefficient (part of a regulartization term) that we add to the loss, as a penalty on the model. This penalty’s job is to control the complexity of the model (size of coeffs/number of coeffs in some cases) to encourage the model not to be over-complex and overfit (memorize) the training data.

lesscomfortable · November 20, 2018, 4:08am

Or no need for a AWS p3

champs.jaideep · November 20, 2018, 4:08am

any more likes to this…
THis can be a very useful thing where u have lot of noisy background not of your interest and very small area of pixels of our interest…just throwing my thought not sure if it can be done or not so seeking calrifications

Jaghachi · November 20, 2018, 4:09am

arnold needed to come to life. he expects nothing less than a p3.

devforfu · November 20, 2018, 4:09am

How wd parameter is related to L1/L2 regularization norm? From the formula that was shown, it looks like coefficient multiplied by L2, right?

sandeepsign · November 20, 2018, 4:10am

my question

sgugger · November 20, 2018, 4:10am

Almost, but not exactly the same thing, as explained in this article.

ritika26 · November 20, 2018, 4:10am

I have read this DLCV book . It is mentioned there. Screenshot attached

Even · November 20, 2018, 4:10am

I have a weight decay question…

Does the amount of weight decay increase proportionally to the size of the network? i.e. will a network with more parameters have more weight decay, and do you need to think about that when you specify it?

rachel · November 20, 2018, 4:11am

PCA can be applied to image data. There are some examples in the computational linear algebra course, for instance, using PCA to identify the foreground vs. background in a video.

lesscomfortable · November 20, 2018, 4:11am

No it doesn’t. Weight decay applies to each specific weight when updating so it is agnostic to the number of parameters.

sgugger · November 20, 2018, 4:11am

That’s not the same thing, it’s a parameter that reduces the learning rate over time.

jcatanza · November 20, 2018, 4:11am

A larger batch will yield more accurate (less noisy) weight updates. The tradeoff is you get fewer weight updates per epoch. On the other hand, a small batch will yield noisier updates, but you get more updates per epoch.

saadorj · November 20, 2018, 4:11am

wd plays the same role in L1/L2 regularization norm. The difference is in the summation of the weights - which is multiplied by wd - (L1 -> sums absolute values of weights while L2 -> sums square values of weights).

KarlH · November 20, 2018, 4:12am

That’s learning rate decay, which is different from weigh decay. Learning rate decay is reducing the learning rate over several epochs. Weight decay affects the update of parameters during back propagation.

ritika26 · November 20, 2018, 4:12am

Okay. Thanks for clearing the difference

PierreO · November 20, 2018, 4:13am

Is something like “differential weight decay” like we do for learning rates make any sense ?

lesscomfortable · November 20, 2018, 4:14am

Definitely. Leslie Smith is working on it, please check my previous post’s link.

Jaghachi · November 20, 2018, 4:14am

nice! this is the type of explanation i was looking for!

“fewer weight updates per epoch” thus needing more epochs, right? as was said by @sgugger ?

EDIT: punctuation

ritika26 · November 20, 2018, 4:14am

Thanks for clearing the difference between weight decay and learning rate decay