I would like to propose a discussion about second order methods for SGD convergence - while 1st order are currently dominating, 2nd order (overview) seem very promising: just fitting parabola in a single direction would allow for smarter choice of step size, they can simultaneously optimize in multiple directions.
One of issues is that Newton’s method attracts to saddles, and there is a belief that they completely dominate the landscape: that there is exp(dim) more of them than minima. One of a few 2nd order methods actively repelling saddles is saddle-free Newton (SFN), claiming to get to much better solutions this way: https://i.stack.imgur.com/MuG7w.png
Why second order methods aren’t getting popularity? Should we go this way?
How to improve them, resolve their weaknesses?
Is saddle-attraction a big issue? Should we actively repel them?