Lesson 5 In-Class Discussion ✅

PierreO · November 20, 2018, 4:58am

So is there any way to transfer existing weights to a slightly different model (different architecture I mean) ?

soco_loco · November 20, 2018, 4:58am

Will the Adam optimizer actually produce a different answer or is just faster than RMSProp etc?

champs.jaideep · November 20, 2018, 4:58am

what does zero grad do, and no grad earlier in update function
may it was answered in beginning but i lost that discussion…

wdhorton · November 20, 2018, 4:59am

it will be different, it’s a different equation

KevinB · November 20, 2018, 4:59am

Thanks for all of the knowledge tidbits tonight, Sylvain! Also, thanks to Jeremy for another incredible lecture.

dalonlobo · November 20, 2018, 4:59am

Thank you fast.ai team.

rachel · November 20, 2018, 5:02am

I just wanted to note that the fast.ai Intro to ML course goes into more detail on feature engineering for the Rossman competition.

ramanan · November 20, 2018, 5:04am

So potentially a good idea to overfit a new model on a small amount of data at first (since its much faster to run)?

roronoazoro · November 20, 2018, 5:04am

can we run the notebooks for ML course in the fastai github with the current version of the library or are they currently compatible with v0.7 version only?

champs.jaideep · November 20, 2018, 5:09am

in momentum process we also take into account the previous step…
So when we overshoot the minima,how do tend to start switching back using momentum process…
Can some one who understood well ,explain please…

jcatanza · November 20, 2018, 5:09am

The regularization coefficient is the multiplier of the L2 or L1 sum. If you are doing L2 regularization, it’s the multiplier of the sum of squared weights. If you are doing L1 regularization it’s the multipliler of the sum of the absolute values of the weights.

jcatanza · November 20, 2018, 5:17am

Remember that the weights correspond to a specific architecture – the one that was used to train the model. It wouldn’t make sense to transfer these weights to another architecture – unless, perhaps, the new architecture contains the old one as a subset.

Trenton · November 20, 2018, 5:23am

I think Adam should just be faster. The results will differ slightly, but they are converging to the (hopefully) same optimum.

lesscomfortable · November 20, 2018, 6:23am

Not really. Imagine you have a 120-sized dataset and you have batches of size 8. Then running an epoch with a small amount of data (size 8) or running an iteration through the first batch of your dataset (size 8) while running the epoch is the same (same weight update). So training on a small percentage of your dataset and then loading the weights generates the same weights as training on the first mini batch of your epoch and then continuing from there.

marcmuc · November 20, 2018, 7:53am

To add to your guys interesting conversation (@sgugger, @PierreO, @Jaghachi) about minibatch-sizes, num of epochs/iterations and learning rates I wanted to point out these 2 interesting papers that have researched many of these interrelations:

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (facebook)

The goal/reason here was different, but one key finding was that when changing the batch size, the learning rate should be scaled linearly too, to get similar training effects (double the batch size, double the learning rate!) (Which would explain the effects that you observed, @Jaghachi)

Same thing (kinda inverted), but again with different reasoning here:
Don’t decay the learning rate, increase the batch size (google brain)

Which finds that raising the batch size has similar effects to lowering the learning rate with stable batch size. They argue that raising batch size instead of lowering learning rates gets rid of some adverse effects of lower learning rates too.

Re: Having a very large batch size and just running more epochs, as @PierreO suggested: My intuition says that it is better to NOT have the model look at the data too often, as this may lead to more overfitting, so running many epochs is often not a good Idea. But then again the averaging accross large batch sizes may cancel that out somewhat, so maybe someone has good research in this direction somewhere?

I have not seen these issues in the fastai notebooks and can’t remember if Jeremy talked about this somewhere? But these batch size effects could again be something to automate within the fastai lib.

In the instances where the batch size was changed i.e. when unfreezing, Jeremy always also ran the learning rate finder and reset the learning rates to something different, so implicitly the lr was always changed when changing batch sizes…

PierreO · November 20, 2018, 11:20am

Very interesting, I’ll check out those links. Thanks !

PierreO · November 20, 2018, 11:28am

I understand that, but given the big boost of transfer learning wouldn’t having a kind of vague mapping from one architecture to the other be kinda cool ? I’ve read that “anything better than random is good” in this thread, so I thought that if we could do a very rough translation from one architecture to the other it would allow to more easily train new architectures. But maybe that’s jut too ambitious ? Or maybe it’s way easier to just re-train the new architecture rather than try to do this.

Then again I remember Jeremy saying that architecture was the last thing you want to change when trying to get a better model, so maybe it wouldn’t be that useful either ?

Moody · November 20, 2018, 11:48am

You may find this interesting.

https://www.fast.ai/2018/08/10/fastai-diu-imagenet/

Fastai library can achieve the same result (using AWS) but 3x faster.

marcmuc · November 20, 2018, 12:18pm

Thanks, I am aware of that. The conversation was not around training speed where fastai beats this approach, but this paper also researches/ explains the relationships between batch sizes and learning rate which was relevant here and which is still interesting

miwojc · November 20, 2018, 1:19pm

Does Learning rate finder take batch size into account?