Lesson 6 - Official topic

victor.vargas · April 22, 2020, 1:56am

For those last weights does it matter how many are added, is it random and is there a benefit of adding more layer of weights vs a smaller number?

sgugger · April 22, 2020, 1:56am

The thing is that you don’t need to know the ideal learning rate to that degree of precision. A rough average wt the right order of magnitude is more than enough to efficiently train your model.

FraPochetti · April 22, 2020, 1:57am

In my experience it is not necessary.
Depending on how long I train, I generally run lr_find 2-3 times per training (where 1 happens at the very beginning).

pinaki · April 22, 2020, 1:57am

why do base_lr/2 – why not another fraction ?

zevarela · April 22, 2020, 1:57am

What does it mean when the lr_finder plot no longer looks like that? I’ve seen nearly flat plots.
Does that mean the model is trained?

sgugger · April 22, 2020, 1:57am

Yes, the added layers are randomly initialized. In our experience, you don’t need that many new layers, since the body of the network has already learned so much.

arora_aman · April 22, 2020, 1:57am

In the learner.fine_tune() method, why do we not call learning_rate finder again and take the min/10th learning rate rather than lr/=2?

sgugger · April 22, 2020, 1:58am

For all those defaults, just try whatever you want o experiment with, and see if you get better results

marii · April 22, 2020, 1:58am

I’m glad Jeremy is talking about the two different shapes for lr_find. I found last year a lot of people got confused on the second shape for a pretrained network.

sgugger · April 22, 2020, 1:58am

Yes, it means the model has already learned thing on the task at end and does not have a randomly initialized part.

0tist · April 22, 2020, 1:59am

when we freeze the last layer, during fine_tune, we obtain the gradients from the last layer but we dont change them and the gradient the are just forwarded to the following layer which are responsible for learning, am i correct ?..correct me if im wrong

sgugger · April 22, 2020, 1:59am

Because in practice, we didn’t find we need it. This is a general method to quickly do transfer learning on a new dataset and have a very good baseline. You can always do it in two stages with 2 lr finds and see if you find better results.

giacomov · April 22, 2020, 2:00am

Sorry, again on the lr_find Was any effort spent into finding a robust “automatic” way of selecting a (possibly conservative) LR? When I try to use fastai in non-interactive jobs, that would be very useful.

arora_aman · April 22, 2020, 2:00am

Thank you

sgugger · April 22, 2020, 2:00am

Freezing means you only compute the gradients for part of the model (in this case the last layers). You don’t even compute the gradients for the rest of the model, let alone update the corresponding parameters.

sgugger · April 22, 2020, 2:01am

Yes. That’s why you have suggested values.

avatar · April 22, 2020, 2:01am

Is

learn.fit_one_cycle(3, lr=1e-3)
learn.fit_one_cycle(7, lr=1e-3)

same as

learn.fit_one_cycle(10, lr=1e-3)?

sgugger · April 22, 2020, 2:01am

Absolutely not, that is what Jeremy is explaining right now.

KevinB · April 22, 2020, 2:02am

Is there something special that needs to be done to create a new random learner? I feel like I’ve had troubles before when running and training a learner and then going back to re-defining a new learner and re-fitting. I don’t have any current examples where I’m having issues though so it’s possible I was just not doing quite what I thought I was.

kodzaks · April 22, 2020, 2:02am

If you’ll train for 50 or 100 epochs will your model generalize better on a brand new data it has not seen before? In other words, you’ll have a very long tail, but the model will still be improving?