Lesson 5 In-Class Discussion ✅

So is there any way to transfer existing weights to a slightly different model (different architecture I mean) ?

Will the Adam optimizer actually produce a different answer or is just faster than RMSProp etc?

what does zero grad do, and no grad earlier in update function
may it was answered in beginning but i lost that discussion…

it will be different, it’s a different equation

1 Like

Thanks for all of the knowledge tidbits tonight, Sylvain! Also, thanks to Jeremy for another incredible lecture.

10 Likes

Thank you fast.ai team.

3 Likes

I just wanted to note that the fast.ai Intro to ML course goes into more detail on feature engineering for the Rossman competition.

9 Likes

So potentially a good idea to overfit a new model on a small amount of data at first (since its much faster to run)?

can we run the notebooks for ML course in the fastai github with the current version of the library or are they currently compatible with v0.7 version only?

1 Like

in momentum process we also take into account the previous step…
So when we overshoot the minima,how do tend to start switching back using momentum process…
Can some one who understood well ,explain please…

The regularization coefficient is the multiplier of the L2 or L1 sum. If you are doing L2 regularization, it’s the multiplier of the sum of squared weights. If you are doing L1 regularization it’s the multipliler of the sum of the absolute values of the weights.

2 Likes

Remember that the weights correspond to a specific architecture – the one that was used to train the model. It wouldn’t make sense to transfer these weights to another architecture – unless, perhaps, the new architecture contains the old one as a subset.

I think Adam should just be faster. The results will differ slightly, but they are converging to the (hopefully) same optimum.

Not really. Imagine you have a 120-sized dataset and you have batches of size 8. Then running an epoch with a small amount of data (size 8) or running an iteration through the first batch of your dataset (size 8) while running the epoch is the same (same weight update). So training on a small percentage of your dataset and then loading the weights generates the same weights as training on the first mini batch of your epoch and then continuing from there.

1 Like

To add to your guys interesting conversation (@sgugger, @PierreO, @Jaghachi) about minibatch-sizes, num of epochs/iterations and learning rates I wanted to point out these 2 interesting papers that have researched many of these interrelations:

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (facebook)

The goal/reason here was different, but one key finding was that when changing the batch size, the learning rate should be scaled linearly too, to get similar training effects (double the batch size, double the learning rate!) (Which would explain the effects that you observed, @Jaghachi)

Same thing (kinda inverted), but again with different reasoning here:
Don’t decay the learning rate, increase the batch size (google brain)

Which finds that raising the batch size has similar effects to lowering the learning rate with stable batch size. They argue that raising batch size instead of lowering learning rates gets rid of some adverse effects of lower learning rates too.

Re: Having a very large batch size and just running more epochs, as @PierreO suggested: My intuition says that it is better to NOT have the model look at the data too often, as this may lead to more overfitting, so running many epochs is often not a good Idea. But then again the averaging accross large batch sizes may cancel that out somewhat, so maybe someone has good research in this direction somewhere?

I have not seen these issues in the fastai notebooks and can’t remember if Jeremy talked about this somewhere? But these batch size effects could again be something to automate within the fastai lib. :wink:

In the instances where the batch size was changed i.e. when unfreezing, Jeremy always also ran the learning rate finder and reset the learning rates to something different, so implicitly the lr was always changed when changing batch sizes…

10 Likes

Very interesting, I’ll check out those links. Thanks !

I understand that, but given the big boost of transfer learning wouldn’t having a kind of vague mapping from one architecture to the other be kinda cool ? I’ve read that “anything better than random is good” in this thread, so I thought that if we could do a very rough translation from one architecture to the other it would allow to more easily train new architectures. But maybe that’s jut too ambitious ? Or maybe it’s way easier to just re-train the new architecture rather than try to do this.

Then again I remember Jeremy saying that architecture was the last thing you want to change when trying to get a better model, so maybe it wouldn’t be that useful either ?

You may find this interesting.

https://www.fast.ai/2018/08/10/fastai-diu-imagenet/

Fastai library can achieve the same result (using AWS) but 3x faster.

2 Likes

Thanks, I am aware of that. The conversation was not around training speed where fastai beats this approach, but this paper also researches/ explains the relationships between batch sizes and learning rate which was relevant here and which is still interesting :wink:

Does Learning rate finder take batch size into account?