For those last weights does it matter how many are added, is it random and is there a benefit of adding more layer of weights vs a smaller number?
The thing is that you donāt need to know the ideal learning rate to that degree of precision. A rough average wt the right order of magnitude is more than enough to efficiently train your model.
In my experience it is not necessary.
Depending on how long I train, I generally run lr_find
2-3 times per training (where 1 happens at the very beginning).
why do base_lr/2
ā why not another fraction ?
What does it mean when the lr_finder plot no longer looks like that? Iāve seen nearly flat plots.
Does that mean the model is trained?
Yes, the added layers are randomly initialized. In our experience, you donāt need that many new layers, since the body of the network has already learned so much.
In the learner.fine_tune()
method, why do we not call learning_rate finder again and take the min/10th learning rate rather than lr/=2?
For all those defaults, just try whatever you want o experiment with, and see if you get better results
Iām glad Jeremy is talking about the two different shapes for lr_find. I found last year a lot of people got confused on the second shape for a pretrained network.
Yes, it means the model has already learned thing on the task at end and does not have a randomly initialized part.
when we freeze the last layer, during fine_tune, we obtain the gradients from the last layer but we dont change them and the gradient the are just forwarded to the following layer which are responsible for learning, am i correct ?..correct me if im wrong
Because in practice, we didnāt find we need it. This is a general method to quickly do transfer learning on a new dataset and have a very good baseline. You can always do it in two stages with 2 lr finds and see if you find better results.
Sorry, again on the lr_find Was any effort spent into finding a robust āautomaticā way of selecting a (possibly conservative) LR? When I try to use fastai in non-interactive jobs, that would be very useful.
Thank you
Freezing means you only compute the gradients for part of the model (in this case the last layers). You donāt even compute the gradients for the rest of the model, let alone update the corresponding parameters.
Yes. Thatās why you have suggested values.
Is
learn.fit_one_cycle(3, lr=1e-3)
learn.fit_one_cycle(7, lr=1e-3)
same as
learn.fit_one_cycle(10, lr=1e-3)?
Absolutely not, that is what Jeremy is explaining right now.
Is there something special that needs to be done to create a new random learner? I feel like Iāve had troubles before when running and training a learner and then going back to re-defining a new learner and re-fitting. I donāt have any current examples where Iām having issues though so itās possible I was just not doing quite what I thought I was.
If youāll train for 50 or 100 epochs will your model generalize better on a brand new data it has not seen before? In other words, youāll have a very long tail, but the model will still be improving?