Understanding use_clr

lgvaz · March 25, 2018, 9:55pm

So at the end of the pascal notebook we have this:
learn.fit(lr, 1, cycle_len=3, use_clr=(32,5))

Looking at the code for learner.py I found that when specifying use_clr we end up using a CircularLR schedule instead of a CosAnneal. But even after looking at the CircularLR code I’m having trouble identifying what the parameters clr_div, cut_div means, and why is better to use a Circular anneal instead of a Cosine in this case?

sgugger · March 25, 2018, 10:32pm

I think this is linked to the cyclical learning rates introduced here. I played around a bit and with 2 cycles with the parameters (20,5) I get this plot of how the learning rate evolved during the cycle (you can do it after a fit with learn.sched.plot_lr())
lr_plot

I think the first one, div, is by how much your learning rate is divided to get the minimum value (the maximum value is the lr you put as a parameter). The second argument seems to be the way you split your cycle between increasing and decreasing: 1/5 of growth and 4/5 of descent.

jeremy · March 25, 2018, 10:35pm

That’s right. FYI it’s designed to be used with just one cycle.

lgvaz · March 25, 2018, 11:08pm

Thx a lot @sgugger, excellent explanation! I will take a look at the paper you linked =)

KevinB · April 1, 2018, 7:00pm

I’m having some issues understanding the second argument. I think I understand the first one as being the number of iterations it takes to go through the full cycle, is that how you understand it? The second one seems to be how fast it rises to the stop and goes back down, but I don’t understand how you know it’s 1/5 and 4/5, is it always that way? Like if I had a 10 in there would it be 1/10 of growth and 9/10 of descent?

sgugger · April 1, 2018, 7:12pm

The first one isn’t a number of iterations, it’s the scale between your minimum and your maximum. With 20 for instance, you begin at 0.0005 if your learning rate is 0.01 (like on the graph I showed).
Then the second one works exactly like you just said: with 10 it would be 1/10 of growth and 9/10 of descent. A typical use would be to do this with a cycle_len of 10, then during 1/10 of the cycle (one epoch) you would grow from your minimum value (lr divided by the first argument) to your learning rate, then for 9/10 of the cycle (nine epochs) slowly get back down to your minimum value.

KevinB · April 1, 2018, 7:15pm

Thanks, that makes sense now. I appreciate the extra explanation.

yggg · April 4, 2018, 3:04pm

Why is it only designed to be used with one cycle? Do you mean the second arg should always be 1 when use_clr is specified, e.g. learner.fit(lr, 1, ..., use_clr=(20, 5))? It seems to me it works perfectly fine even when 2nd arg is some value other than 1. Thanks in advance!

jeremy · April 5, 2018, 1:03pm

Leslie Smith’s new paper is the best place to read about the reasoning. The code will work fine with more cycles, but the paper recommends using just one.

alessa · April 15, 2018, 6:18pm

It’s a bit confusing his reasoning in the paper. Firstly he says

Furthermore, there is a certain elegance to the rhythm of these cycles and it simplifies the decision of when to drop learning rates and when to stop the current training run. Experiments show that replacing each step of a constant learning rate with at least 3 cycles trains the network weights most of the way and running for 4 or more cycles will achieve even better performance.

And then he says that actually the best is to stop after the 1st cycle

Also, it is best to stop training at the end of a cycle, which is when the learning rate is at the minimum value and the accuracy peaks

But when we look at figure 1, we see the accuracy peaks at the end of each cycle. And eventually the accuracy gets slightly better on cycle 2 and 3.

jeremy · April 16, 2018, 3:38am

He says to stop at the end of a cycle, not the first cycle. (Although the most recent 1cycle paper changes that recommendation).

shik1470 · April 26, 2018, 4:10am

I’m using cyclical LR with parameters as use_clr = (30,5). It’s clearly working better for me compared to normal cosine annealing. I want to understand if there are some guidelines for choosing the 2 numbers or is it purely experimental? If people who have tried it can share their observations then it will be really helpful. Thanks in advance!

jeremy · April 26, 2018, 5:35pm

Generally something around (10,10) seems to work pretty well. If you’ve got more epochs or larger dataset, try changing the first number to 40. For small datasets, try changing the second number to 5.

shik1470 · April 29, 2018, 5:21am

@jeremy Thanks for the pointer. I’ll try that and see if it works.