Second order methods for SGD convergence?

jarekd · April 21, 2019, 5:41am

I would like to propose a discussion about second order methods for SGD convergence - while 1st order are currently dominating, 2nd order (overview) seem very promising: just fitting parabola in a single direction would allow for smarter choice of step size, they can simultaneously optimize in multiple directions.

One of issues is that Newton’s method attracts to saddles, and there is a belief that they completely dominate the landscape: that there is exp(dim) more of them than minima. One of a few 2nd order methods actively repelling saddles is saddle-free Newton (SFN), claiming to get to much better solutions this way: https://i.stack.imgur.com/MuG7w.png

Why second order methods aren’t getting popularity? Should we go this way?
How to improve them, resolve their weaknesses?
Is saddle-attraction a big issue? Should we actively repel them?

renard · January 31, 2020, 11:24pm

Yeah, I came here to ask similar questions and found this post. What I’m wondering is why algorithms that try to utilize information about the second-order derivative, by approximation or exact calcualtion, aren’t gaining in popularity? In basically every paper I read 1st order methods are always used. Is there a reason behind this or it’s just the matter of how memory inefficient they are?

jarekd · February 7, 2020, 8:25pm

Here is some discussion: https://stats.stackexchange.com/questions/394083/why-second-order-sgd-convergence-methods-are-unpopular-for-deep-learning

The memory issue can be handled by using 2nd order model in a low dimensional subspace.

Pomo · February 7, 2020, 10:19pm

Hi all. Jarek and I have already had a short discussion of this topic at
https://forums.fast.ai/t/2nd-order-learning-rate-adaptation-minimizing-parabola-in-one-step/

I am only an amateur mathematician, so much of the theoretical discussion goes over my head. But over the study of this course I’ve become if anything more biased toward demonstrated practical techniques. If anyone did actually implement a 2nd order optimizer, compare its performance against commonly used methods, and find it was better, I’d be the first to adopt it. But so far all I have seen are many theoretical arguments for and against.

The commonly available 2nd order optimizer is LBFGS. I have seen it used effectively in limited contexts (where the loss landscape is parabolic). But when I tried it myself on larger problems, it was no better than ADAM. There is also the argument that the momentum based optimizers are already in effect approximating the 2nd derivative.

Just as a datapoint, I once implemented both SGD line search for ideal step size, and Newton’s method along axes. I had great hopes and enthusiasm, but in actual testing they performed worse than vanilla optimizers.

Don’t get me wrong. I really want 2nd order methods to work, for their elegance and efficiency in theory. But I have yet to see one demonstrated that works well in practice.

Cheers,
Malcolm

LessW2020 · February 7, 2020, 11:40pm

Has anyone actually shown a second order method working on a more difficult dataset? Computationally things blow up quickly.

Normally every paper on this is solving MNIST at best, and then when I deploy it on ImageNette or ImageWoof…poof, all the magic disappears