Why one iteration at lr=0.001?

Jeremy recommends that when training a model, we first do one epoch at a low learning rate (e.g. 0.001), and then move to a high value (e.g. 0.1) and gradually decrease.

What is the purpose of the first epoch at a low learning rate? Jeremy mentions that it “gets things started”, but I don’t see an intuitive interpretation for that.

That’s discussed here in lesson 4 (26m5s) and at later points in the video

Thanks for the link, but I don’t see any explanation for why there should be a single epoch at low rate before moving to a high rate and then gradually decreasing. In the segment that you link to, Jeremy explains what happens if the rate is too high, but not why a first epoch would matter.

He says you want a low rate first to avoid jumping back and forth across the curve.

It might be worth watching that whole lesson, if you haven’t already, to get some more intuition about choosing an optimizer and their parameters. I certainly couldn’t explain it better than Jeremy, I’m only at the early stages of these lessons, and just answered your question because it had zero responses and thought I could maybe point you in the right direction.

As far as I can see, the jumping back and forth thing applies to the learning rate in general, not to the first iteration in particular. If lr=0.1 is too high for your problem, then first doing an epoch at lr=0.001 before moving to 0.1 isn’t going to help, right? At least I don’t understand why it would, and I can’t seem to find any passage in the lecture that explains it.

My understanding of it is, and I could be wrong since I’m new enough to this field, that when you first start training a model, there is a possibility of getting stuck in a saddle point that is far above the global minimum of the gradient (maybe the gradient is noisy at the start since the weights have just been initialised - not sure).

Anyway, when that happens, the loss jumps about but doesn’t go down. So you might think “damn, my model is rubbish” and go back to adjusting the architecture or data. However that could be counterproductive since you might already have a decent model for the problem you are trying to solve.

So by lowering the learning rate significantly at the start of training you increase the likelihood that the model can be teased out of those early saddle points. Once it’s out of there, and the loss is consistently decreasing, you can increase the learning rate and descend the gradient as normal.

I’m totally open to correction here, but that’s how I understand it anyway.

1 Like

Interesting, did you get this interpretation from watching the lectures, or did you read it somewhere else? I did not find much when googling “early saddle points”, but I came across this blog post on saddle points which I found interesting: http://www.offconvex.org/2016/03/22/saddlepoints/

Reading through the mid 2016 paper Systematic evaluation of CNN advances on the ImageNet (discussed in Lesson 11), I see that they’ve tested a few learning rate policies in section 3.3, however all of the policies shown are monotonically decreasing, without mention of a first epoch at low rate.