(Bare with me here because this might be a bit incoherent.)

I’ve been thinking a little bit about the learning rate schedule.

Right now I believe it’s been empirically determined that increasing learning rate and decreasing momentum then doing cosine annealing while we increase momentum works pretty well.

It’s occurred to me that these magic heuristic functions are the very types of things that we should be able to improve with machine learning. What if we could learn the optimal learning rate to use on each batch to maximize the decrease in our loss function?

The problem is that we don’t have a differentiable function relating our learning rate to our loss so we can’t use autograd. But… shouldn’t we be able to estimate the slope by “trying” a few learning rates above and below our current LR each batch (resetting back to the previous state each time) and choosing the one that minimizes the loss?

It would certainly be slow… but maybe it could provide insight into what a better function for the learning rate schedule might be.

Is this a completely crazy idea?

I found this paper that had a similar idea but if I understand correctly they were trying to find a single constant learning rate that would be optimal rather than an optimal schedule that changes over the course of training.