We can choose learning rate to minimize parabola in one step of gradient descent: (theta,g) are in line for parabola, we can find slope of this line e.g. by division of their standard deviations:
learnig rate = sqrt( var(theta) / var(g) ) gives minimum of parabola in one step of gradient descent
Its calculation requires maintaining four (exponential moving) averages: of theta, theta², g, g², e.g. adapting learning rate separately for each coordinate of SGD (more details in 5th page here).
E.g. RMSprop, Adam use sqrt(mean(g^2)) denominator - what is not sufficient to minimize parabola in one step.
Is there considered 2nd order adaptation of learning rate in literature?
Hi Jarek. Pytorch’s LBFGS optimizer is the only second order optimizer in active use that I know about. I think that searching for LBFGS and following article references would lead to most publications about second order methods. But I have never seen much research activity in this direction.
Your paper proposes an interesting method, or class of methods. (I only skimmed it.) My reservation is that the paper is theory, without experiments measuring the method’s performance relative to established optimizers.
I also was interested in these questions and even implemented a per coordinate Newton’s method as a PyTorch optimizer. I had great hopes, but it did not work well in practice. One informative failure was a toy problem whose loss made a barely sloping parabolic valley along a diagonal relative to the weight coordinates. The optimizer bounced between sides of the valley making negligible progress. Ordinary Adam and SGD worked much better. The optimizer was also strongly captured by local minima. Maybe some kind of gradient smoothing as suggested in the paper would solve this.
Irrespective of my failure with a simplistic 2nd order method, I encourage you to implement the paper’s methods and give them a test on standard benchmarks. Quantitative experimentation seems to be how new ideas are adopted and the field progresses.
Good luck! I have an ongoing interest in second order methods and would like to know what you discover.
Yes, I know L-BFGS, here are slides focused mainly on 2nd order (also many interesting papers about Hessian behavior): https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf
This is much simpler philosophy - just adapt learning rate of gradient descent, optimal one is given by linear trend of gradients - modeled position of minimum is where this trend intersects g=0.
I haven’t seen anything like this in literature (?), but seems very promising - we can get methods looking similar to RMSprop, Adam, but being really 2nd order: minimizing parabola in one step, independently for each parameter.
Indeed it needs experimentation, but I am theoretician working in Mathematica (currently mostly on new statistical modelling now) - I wanted to test it, but tensorflow was a nightmare. Maybe I will give it second chance during summer break.