Optimizer vs Learner Hyperparameters

stantonius · March 9, 2021, 9:08pm

Hi fastai community

I am struggling to understand the difference between the hyperparameters specified in the optimizer vs the Learner and am looking for some help/direction.

Let’s take the basic SGD with momentum optimizer as an example:

SGD(params, lr, mom=0.0, wd=0.0, decouple_wd=True)

and lets assume we experiment with setting the params of this optimizer function with a partial as seen below:

partial(SGD, lr=1e-5, mom=0.8, wd=0.01)

We know that the Learner class has the same parameters defaulted:

Learner(dls,...,opt_func=Adam, lr=0.001,...,wd=None, ...,moms=(0.95, 0.85, 0.95))

So in this experiment, we pass the partial function to the Learner and end up with something like:

Learner(dls,...,opt_func=partial(SGD, lr=1e-5, mom=0.8, wd=0.01), lr=0.001,...,wd=None, ...,moms=(0.95, 0.85, 0.95))

My questions are:

Which set of lr, mom, and wd are overwritten (ie. which lr, mom, and wd have the higher priority?) Or do they have different and distinct purposes?
Why do we specify a single mom to the optimizer, but a tuple of moms to the learner?
Although I haven’t mentioned it above, one further area of confusion (that I think is related) are optimisers that have adaptive learning rates (such as Adam). I have read that Adam “computes individual learning rates for different parameters” - if this is the case, why do we specify the lr in the Learner class?

It is entirely possible I have overlooked a fundamental principle about optimizers and misunderstand them, as there is clearly a reason you can specify these params in multiple places, yet when I have looked at the source code I cannot find how the optimizer parameters are incorporated by the Learner.

Many thanks for any input or resources you may have on this subject