Coding AvaGrad optimizer - help on a 'd' math term? (fixed)

LessW2020 · December 12, 2019, 11:43pm

Hi all,
I’m trying to code up the Avagrad optimizer which looks to outperform SGD by adaptively scaling the variance…however, in their paper they have magically included a d reference…with no explanation what it is.

I’m wondering if anyone can clarify what they are referring to in the green box I’ve highlighted in the pseudo-code? Is d a known term that I simply haven’t seen and that’s why they include with no explanation? I searched the entire paper to find it with no results.

*Of course I am emailing the authors but hoping @sgugger or similar may be able to answer faster as I’m ready to test except for that missing item.

Here’s the paper:

Thanks - hopefully have this tested shortly to see if we can build an even better general optimizer using the Adaptive variance.

Less

LessW2020 · December 13, 2019, 12:51am

Ah already got email back from one of the authors (Pedro):

d = “d is the number of parameters of the network (and the dimensionality of eta).”

They are posting code this weekend and I’ll be testing AvaGrad soon and hopefully post some results vs ImageWoof etc.

Best regards,
Less

LessW2020 · December 13, 2019, 3:28am

I’ve got AvaGradW up and running (as best as I can tell) and it’s performing pretty well.

They should be posting official code this weekend though so I’ll move over to that once it’s posted in case I missed anything.

Best regards,
Less

LessW2020 · December 13, 2019, 5:44am

Here’s AvaGradW + MixNet on ImageWoof.
Yes, that learning rate really is 1E-1 It’s b/c Avagrad is auto-adaptive.

morgan · December 18, 2019, 3:15pm

Hey Less, do you think AvaGrad will dethrone the Ranger flavours any time soon?

LessW2020 · December 20, 2019, 3:53am

Hi @morgan - I hope so!
Both AvaGrad and SLS esp are more focused on trying to be adaptive and somewhat automated.
SLS is especially automated as it does a line search at every epoch and can actually run some steps (w/o gradient) to ‘preview’ the changes and adjust step size accordingly.

I’m waiting for SLS to update to support param groups and then will be able to do some broader testing with AvaGrad and SLS to provide some results and recommendations.
(Note - don’t use fit_one_cycle with AvaGrad, I can say that lol).

Best regards,
Less