Lesson 5 In-Class Discussion ✅

rachel · November 20, 2018, 4:28am

Forcing weights towards zero encourages generalization. It keeps you from having each of ten thousand different parameters being specific to a different one of ten thousand different inputs in the training data.

champs.jaideep · November 20, 2018, 4:29am

in old course of fai we have been using wds as 1e-7 …so is tht too low as per latest version or it should depend

lesscomfortable · November 20, 2018, 4:30am

Because it has more power, and if it uses it well it can learn more complex features that are useful in the real world. Not abusing of this representative power means not using it to learn details that do not matter in the real world. This is what we try to avoid with regularization.

nkkacirek · November 20, 2018, 4:32am

I wrote down that when Jeremy described weight decay as subtracting some constant from the weights, he said: “This is weight decay, not regularization.” Can you help explain why this is?

PierreO · November 20, 2018, 4:33am

So if for example resnet34 is already overfitting without fine-tuning regularization parameters, moving to resnet50 might still be useful in case it can learn better features ?

jcatanza · November 20, 2018, 4:33am

A bigger architecture has the advantage that it explores more complex models. Of course more complex models are vulnerable to overfitting. But using weight decay or other forms of regularization mitigates against overfitting. So you get to have your cake and eat it too!

Lothar · November 20, 2018, 4:33am

Is there anywhere I can find a copy of that spreadsheet he’s using?

sgugger · November 20, 2018, 4:33am

I think he must have said, this is weight decay not L2 regularization. The nuance is explained in the article I mentioned before.

lesscomfortable · November 20, 2018, 4:37am

Moving to resnet50 will probably increase performance since weight decay is applied to each weight and is independent of the size of the model. You will probably still be overfitting and benefit from regularization. If you try it, let me know your results!

PierreO · November 20, 2018, 4:37am

Thanks a lot for the answers

radikubwa · November 20, 2018, 4:40am

Forward and backward selection could be another way of dimensionality reduction considering if the problem is a regression problem?

champs.jaideep · November 20, 2018, 4:40am

how does nn.module ,where we do bias=true … get bias elements for its layer

lesscomfortable · November 20, 2018, 4:41am

This is answered higher in this thread.

jcatanza · November 20, 2018, 4:41am

By imposing restriction on the space of the allowed weights, L2 (or L1) regularization limits the flexibility of the model to fit the nooks and crannies of an individual data set, so it helps avoid overfitting. N’est-ce pas?

aidan.davis · November 20, 2018, 4:43am

How many people are in the room with Jeremy?

We’ve noticed the numbers of viewers on the live stream have dropped a lot since the first and second lecture.

cedric · November 20, 2018, 4:44am

I think Jeremy will make the Excel spreadsheet available after the class in the course-v3 repo, just like the previous ones: https://github.com/fastai/course-v3/tree/master/files/xl

mrandy · November 20, 2018, 4:45am

Can’t wrap my head around how is that learning rate decay and momentum and fit_one_cycle work together rather than interfere with each other.

miwojc · November 20, 2018, 4:46am

Question on fit_one_cycle plots:
at what value the learning rate starts and ends, and what is the max point value?

lesscomfortable · November 20, 2018, 4:47am

This explanation answers your question!

devforfu · November 20, 2018, 4:47am

Sorry for repeating my question. But does the learning rate plot show the rate’s change during the whole training process? It was mentioned that the batch index is shown on the X-axis. Though probably I just need to read the paper