Lesson 5 In-Class Discussion ✅

Forcing weights towards zero encourages generalization. It keeps you from having each of ten thousand different parameters being specific to a different one of ten thousand different inputs in the training data.

4 Likes

in old course of fai we have been using wds as 1e-7 …so is tht too low as per latest version or it should depend

Because it has more power, and if it uses it well it can learn more complex features that are useful in the real world. Not abusing of this representative power means not using it to learn details that do not matter in the real world. This is what we try to avoid with regularization.

1 Like

I wrote down that when Jeremy described weight decay as subtracting some constant from the weights, he said: “This is weight decay, not regularization.” Can you help explain why this is?

So if for example resnet34 is already overfitting without fine-tuning regularization parameters, moving to resnet50 might still be useful in case it can learn better features ?

1 Like

A bigger architecture has the advantage that it explores more complex models. Of course more complex models are vulnerable to overfitting. But using weight decay or other forms of regularization mitigates against overfitting. So you get to have your cake and eat it too!

Is there anywhere I can find a copy of that spreadsheet he’s using?

3 Likes

I think he must have said, this is weight decay not L2 regularization. The nuance is explained in the article I mentioned before.

1 Like

Moving to resnet50 will probably increase performance since weight decay is applied to each weight and is independent of the size of the model. You will probably still be overfitting and benefit from regularization. If you try it, let me know your results!

1 Like

Thanks a lot for the answers :slight_smile:

2 Likes

Forward and backward selection could be another way of dimensionality reduction considering if the problem is a regression problem?

how does nn.module ,where we do bias=true … get bias elements for its layer

This is answered higher in this thread.

By imposing restriction on the space of the allowed weights, L2 (or L1) regularization limits the flexibility of the model to fit the nooks and crannies of an individual data set, so it helps avoid overfitting. N’est-ce pas?

How many people are in the room with Jeremy?

We’ve noticed the numbers of viewers on the live stream have dropped a lot since the first and second lecture.

I think Jeremy will make the Excel spreadsheet available after the class in the course-v3 repo, just like the previous ones: https://github.com/fastai/course-v3/tree/master/files/xl

2 Likes

Can’t wrap my head around how is that learning rate decay and momentum and fit_one_cycle work together rather than interfere with each other.

2 Likes

Question on fit_one_cycle plots:
at what value the learning rate starts and ends, and what is the max point value?

This explanation answers your question!

Sorry for repeating my question. But does the learning rate plot show the rate’s change during the whole training process? It was mentioned that the batch index is shown on the X-axis. Though probably I just need to read the paper :smile: