Chapter 10 NLP discriminative learning rates

Hi,

Just a quick question (I think!)

In chapter 10 NLP, page 349 of the book, there’s the line:

learn.fit_one_cycle(1, slice(1e-2/(2.6^4),1e-2))

I am trying to understand why the number 2.6^4 was chosen over say, 46, or 8.5e-22 (which would be the result of calculating (1e-2/(2.6^4))

My guess is that it makes more intuitive changes to the scaling, if our changes have an inverse quadratic effect.

Does anybody know the reason?

Thanks, Mike.

Unlock

Excellent

The OneCycle learning rate scheduler interpolates on a range from a minimum to maximum lr. Here we set it using the range function slice although you can also set one value like so:

learn.fit_one_cycle(1, 1e-2)

and it will interpolate starting from a lower value e.g. 1e-2/10 to 1e-2. The number 2.6^4 is used to set the starting lr to be even lower.

Thanks! So I understand that the 1e-2 in the equation is left there to make it clearer that we’re going from a fraction of the max LR up to the max LR.

But why is the denominator in the form ‘2.6 to the power of 4’ rather than just putting 46 in the denominator?

Perhaps it was a typo and they meant to put 2.6e-4, but it just worked anyway.

In the context of learning rates, 2.6^4 is likely chosen based on empirical experimentation or specific tuning for the model and dataset in question. The value represents a scaling factor that helps achieve a balance between convergence speed and stability during training.

Using a value like 2.6^4 allows for a smoother adjustment of the learning rate, potentially leading to better performance. The choice is often guided by the characteristics of the optimization landscape and the particular architecture being used.