Hmmmmmm. But it can still go up to one, which it does at LR<~1e-2 and above ~1e12. How is it able to stay down at the .3 level over such a huge range of learning rates? My intuition is totally failing me.
I don’t think it’s randomly finding values which happen to give low error rates, because this graph seems pretty consistent over several different runs.
Is it that there is a huuuuuuuge “flat spot” on the manifold and it takes a while for a step to randomly step out of it? (If so, why would the manifold have such a huge flat spot?)