"Note that L-BFGS was empirically observed to be superior to SGD in many cases, in particular in deep learning settings"from here
I just discovered Vowpal Wabbit and I am looking into its underlying parts and they talk about the L-BFGS optimization method.
Do any of you smart people have resources or an explanation on this for us lesser mortals?
My feeble understanding thus far is:
!= Newtonian approximation but QuasiNewton ???
The L-BFGS algorithm, named for limited BFGS, simply truncates the BFGSMultiplyBFGSMultiply update to use the last m input differences and gradient differences.