Dynamic learning rates questions

In lesson 4 we talked about a bunch of different approaches to dynamic learning rates. I’m trying to make sure I understand them and have a few questions.

Looking at the implementation of Adagrad here, it seems to me that the learning rate for a parameter gets updated based on the magnitude of the change to that parameter over an epoch. Sebastian’s comments suggest something different:
“we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance… Adagrad [3] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters”

Based on the implementation, it seems like how often a parameter gets updated isn’t as important as the error gradient for any given update. If we’re at a point with a high learning rate, the parameter may be adjusted by a larger amount in one update than a point with a low learning rate is over multiple updates, so the frequency with which a parameter is updated seems misleading. I might be misunderstanding Sebastian’s terminology here.

Additionally it seems like a parameter that has been adjusted by a lot will have its learning rate shrunk down the most, and a parameter that hasn’t been adjusted much will use the default learning rate. I can see how this would prevent overshooting the optimum and also allow the areas with the lowest gradient to continue to train quickly. However in my head, if the error gradient is high, then we’re far from the optimum, so reducing the learning rate will make it take longer to reach.

I’m reading over the paper - do you know of any simple explanations of the technique?

Finally are there techniques that try to search for a global minima, or at least find out how local a local minima is and pick a best-of out of any others that get found?

p.s. I’ve put my notes up on the wiki here, feel free to add/remove/edit/fix!

I implemented both of those in the graddesc.xlsm spreadsheet - if you look at the column that calculates the new values of the parameters, you can use Excel’s formula auditing tools to see how it’s put together. Let me know if I can help with any explanations.

As I mentioned in the video, I don’t think Eve is a great approach - I’m not sure it’s worth spending time on, personally.

Also, I’d suggest focusing on RMSProp rather than adagrad or adadelta, since it’s more widely used, simpler to understand, more resilient, and is what’s used in Adam.

Did you see this one?

Learning to learn by gradient descent by gradient descent

Yes, but I haven’t seen any state of the art results with it, so it’s not something I’ve studied carefully as yet.

1 Like