How is Learning rate chosen?

In the videos and in thedocs, Jeremy says to pick the learning rate between a range where the loss graph has the steepest slope. Why is this? Shouldn’t the learning rate be fixed at the point where loss is the lowest? That would find the local minimum that fastest and most efficiently, no? Could anyone clarify this for me.


The plot is made by training the model while increasing the learning rate at each step. The horizontal axis can thus be regarded not only as the learning rate, but also time, as the model trains and improves. You are looking for a learning rate that can decrease the loss quickly, thus you look for where the slope is the steepest.

If you look at the lowest loss value, and look left and right, you see that the loss wasn’t improving that much at that point in the training. That means that those learning rates aren’t helpful anymore for reducing the loss.

I hope this was helpful! I was confused at the beginning as well.

Hi cereal_bird hope you are having a marvelous day!

I always find looking at two or three explanations helps me understand things easier.
Below are some links that cover learning rate!

Cheers mrfabulous1 :smiley::smiley:

1 Like

Thanks for the answer. To clarify, the learning rate is always changing?


1 Like

The specifics of how the finder function is implemented are unknown to me, but I would say yes, at every step (weight update), they change the learning rate.

That’s where my confusion was. I though that the learning rate was a fixed number. Thanks.

Glad that I could help!

So Jeremy sometimes passes in a single number to the LR and that may be where the confusion is. That single number is the maximum learning rate for the learner object, but it still means that the learning rate is always changing.

It starts off small, increases to the max LR, then decreases again. So it’s like this shape: /\

1 Like

Is increasing then decreasing LR proven to have an advantage? How much does it help?

Increasing the learning rate value to a specified max value and then decreasing it helps in generalizing better , in the sense,that this prevents the model to get stuck in some local minima present in the loss function and the model is able to generalize better by reaching the global minima of the loss.

The initially increasing slope of the learning rate cycle does its best to jump out of any local minima on the loss surface…post that, a decreasing slope of the learning rate helps you update the weights slowly when the model is in the global minima region