Hi fastai community
I am struggling to understand the difference between the hyperparameters specified in the optimizer vs the Learner and am looking for some help/direction.
Let’s take the basic SGD with momentum optimizer as an example:
SGD(params, lr, mom=0.0, wd=0.0, decouple_wd=True)
and lets assume we experiment with setting the params of this optimizer function with a partial as seen below:
partial(SGD, lr=1e-5, mom=0.8, wd=0.01)
We know that the Learner
class has the same parameters defaulted:
Learner(dls,...,opt_func=Adam, lr=0.001,...,wd=None, ...,moms=(0.95, 0.85, 0.95))
So in this experiment, we pass the partial function to the Learner and end up with something like:
Learner(dls,...,opt_func=partial(SGD, lr=1e-5, mom=0.8, wd=0.01), lr=0.001,...,wd=None, ...,moms=(0.95, 0.85, 0.95))
My questions are:
- Which set of
lr
,mom
, andwd
are overwritten (ie. whichlr
,mom
, andwd
have the higher priority?) Or do they have different and distinct purposes? - Why do we specify a single
mom
to the optimizer, but a tuple ofmoms
to the learner? - Although I haven’t mentioned it above, one further area of confusion (that I think is related) are optimisers that have adaptive learning rates (such as
Adam
). I have read that Adam “computes individual learning rates for different parameters” - if this is the case, why do we specify thelr
in the Learner class?
It is entirely possible I have overlooked a fundamental principle about optimizers and misunderstand them, as there is clearly a reason you can specify these params in multiple places, yet when I have looked at the source code I cannot find how the optimizer parameters are incorporated by the Learner.
Many thanks for any input or resources you may have on this subject