Cosine annealing overwrites selected optimizer?

runze · January 20, 2018, 11:58pm

Hi,

I have a perhaps dumb question: when training a model, we need to choose an optimizer such as optim.Adam, and if, during the training process, we further choose to use cosine annealing to schedule our learning rate (by supplying a cycle_len parameter), we are effectively overwriting the learning rate as computed by our optimizer, are we? If that is the case, does it still matter which optimizer we choose?

Thanks.

tensoralex · January 21, 2018, 9:41pm

My understanding that cyclic LR schedule (cycle_len) operates on a higher level - it effectively creates more epochs and pushes different initial learning rates to a selected optimizer.
Optimizer (like Adam) adapting the parameter learning rates on a lower level within an epoch. So it is still much more important how optimizer works and how well it can adapt learning rates.

It is not too much of overwriting, but “disturbing” learning rates to prevent optimizer to stuck on one local weight space and possibly find better and more stable ones.

runze · January 21, 2018, 11:04pm

Thanks. Although I do think LR scheduler works on a lower level within an epoch too. For example, in the fit function, you can see within each iteration under each epoch, it updates learning rate via callbacks in addition to calling optimizers via stepper:

for epoch in tnrange(epochs, desc='Epoch'):
        ...
        for (*x,y) in t:
            ...
            loss = stepper.step(V(x),V(y))
            ...
            for cb in callbacks: stop = stop or cb.on_batch_end(debias_loss)

So it seems to me that cosine annealing (or some other LR scheduler) updates the learning rates upon each iteration (so it affects more than just the initial learning rates) and what optimizer does in this case is just taking them to compute moving averages and such to produce a new LR. Hence, optimizers do add something to it.

ml_novice · January 24, 2021, 4:07am

Thanks for this post. I had the same exact question and haven’t found a definitive answer yet.

For what it’s worth, I’ve been experimenting with different combinations of PyTorch optimizers and LR schedulers. Still coming to grips with the math but trials run using Adam and a cosine annealing scheduler significantly outperformed any results I achieved using SGD and various LR techniques. It seems to confirm that although the scheduler affects the higher-level LR—it anneals the same regardless of optimizer—the lower-level optimizations are having a distinct effect beyond that.