Hi again. I created some confusion with my first reply, and would like to clear it up. Hopefully without adding more confusion!

The theorem from calculus is right in theory, but not very relevant in practice. It lets us be confident that gradient descent methods will lead to a local minimum if the step is small enough. But once momentum and fancy optimizers come into the picture, all bets are off. These try to estimate a step that makes the loss decrease the most. That step could be many times larger than the calculus step, not exactly in the gradient direction, and may even be wrong. In fact, the loss occasionally increases after a weight update. (Note that fastai’s loss graph display is smoothed, so you don’t see many loss increases.)

Sorry for my misleading theoretical answer. Brent @bgraysea you are right that the update is done all at once. The only new information that the optimizer has to work with at the update is the current weights and the current gradient. It may also have saved running averages of these (momentum, Adam), or the previous gradients (2nd order optimizer). It then uses everything to come up with a direction and size for the “best” step. But the outcome is not guaranteed by theory.

Here’s some musings, and all that follows are just my opinions. The Universal Approximation Theorem say that if a model is complex enough then it can approximate any function. But no one would ever use its construction in practice. The model it gives would have an enormous number of units, and who knows if it would even generalize to new data. Instead we use working, practical knowledge of which architectures work well in which domains. These models fit in a GPU, train well, and generalize. Still, the UAT gives us confidence that what we are trying to achieve is possible, not futile.

Likewise, the calculus theorem tells us that gradient descent methods, in general, work. But no one would train with tiny steps in the negative gradient direction. We now have practical knowledge about optimizers that take bigger steps approximately along the gradient, and train faster overall. Still, calculus theorems tells us that this is a reasonable approach.

Matt @machinethink, if you will forgive me for any misunderstandings, I think your scenario applies to a large step size. The loss can indeed go up, as you point out. Way “out there” from the current weights there are no guarantees about the loss value or its gradients. But if you make the step size smaller and smaller there will come a step size where the loss is guaranteed to decrease.

Thanks for reading my post and for the chance to clarify!