Cyclical Learning Rate fastai implementation Clarifications

DrC · June 17, 2019, 6:07pm

I am looking for some clarifications regarding CLR implementation if anybody can help.

The implemented code here “How Do You Find A Good Learning Rate” by @sgugger makes sense but in fastai implementation num_it=100, which confuses me, so is the iteration number same as number of batches? or it is always a constant that may contain more or less images than a batch?
When recording the loss values, do we record them only for the first cycle? Or do we aggregate the losses of different cycles? I am thinking if it is for only the first cycle, when does the cycling happen? so I guess we run it over many cycles but how are they all combined?
Would you please elaborate on this epochs variable in the lr_find implementation? how is that related to actual epochs? this variable is the number of iterations over the size of training samples.

epochs = int(np.ceil(num_it/len(learn.data.train_dl)))

Thanks in advance and apologies if the questions are trivial.

xeTaiz · June 18, 2019, 7:43am

the num_it determines the amount of different learning rates to be tested. It basically determines how many points you get to use in your learn.recorder.plot().
I’m a bit confused by this one. Is this question related to the LR finder?
With cyclical learning rates you just vary your learning rate over the course of training. Your getting a new loss every iteration here and use it to update your model, just that the updating happens with different speeds. Usually you then use the weights with which the model performed best, whether that happens in the first or 5th cycle. I think I have seen approaches that try to combine the weights obtained at all the “learning rate lows” (where your model should have converged nicely after the annealing part). However as far as I know fast.ai does not do that. It just schedules your learning rate.
len(learn.data.train_dl) is basically the number of batches, num_it is the amount of different learning rates you want to test. If the num_it is smaller than the number of batches, the fraction is <1 and thus ceiled to 1. If the num_it is larger than the number of batches in your DataLoader, you will need more epochs in order to test each of the num_it learning rates, thus this line will give you always at least as many epochs as you need to have enough batches to test all LR’s.

Hope I could help Greetings

DrC · June 18, 2019, 3:51pm

Thanks Dominik. Yes, certainly clears up some concerns… so from your (3), number of iterations can be different than number of batches. That’s what I thought but in many places, i’ve found iteration and mini-batch to be used interchangeably which was confusing.

To follow up:

Yes, this is related to the LR finder. When we plot LR, say from 1e-7 to when the loss shoots up, this is taken only from the last cycle, correct? Oh, this has nothing to do with CLR?

i think my confusion is combining the lr finder with clr scheduler. Does CLR affect our lr finder or it is solely an optimizer? maybe i should implement it to better understand it.

I’m guessing that from lr finder, we get the upper bound of our CLR, correct? Now, how about the decrease that is repeated every half step size?! Also, earlier layers will be bounded with a different lr rate in their optimization?!
Anyway, i am sure you can see my confusion… any clarifications would be helpful.

Thanks again. Cheers

xeTaiz · June 19, 2019, 8:55am

Let me clarify this iteration vs batch thing: In the case of the LR finder, you necessarily have a fixed number of iterations beforehand, because that is the number of different learning rates you want to try. This number might be different to the number of batches in one epoch. Since you fix your batch size and have a fixed number of elements in your dataset, you will result in a fixed number of batches per epoch. With the LR finder you can then just grab as many batches as you need iterations (if there are not enough batches in 1 epoch, the number of epochs is adjusted). When actually training a model, a batch / mini-batch is usually equivalent to an iteration (unless you decrease your batch size for memory reasons but want a better gradient estimation, then you might want to do multiple batches per iteration. Ultimately iteration = weight update…)

On your follow up:
I think you’re mixing up the LR finder and CLR. While the same guy, Leslie Smith, came up with both things (I think even in the same paper), there is no cycling involved in the LR finder.
I’ll try to extract the concepts you describe out of your post and clarify.

The LR finder grabs a fixed number (num_it) of batches from your data, puts them through the model, computes gradients and adjusts the weights according to the LR they should check. With the updated weights you check the loss using the resulting model weights (tbh, I am not sure which loss is computed here exactly, but I assume its the loss of the validation set.)
The CLR schedule happens when you are actually training. The LR from the finder is used as maximum lr in this schedule. Now this schedule only scales your learning rate (maybe your momentum as well, at least the newer stuff like one_cycle). That means at different times in training you update your weights with different speeds. Jeremy explains nicely how the increasing part helps to jump out of undesired narrow local minima and the decreasing part helps anneal to the minimum point without jumping past it.
You also bring up discriminative learning rates, which means that different layers may be trained with different learning rates. The idea of this is mostly useful with transfer learning. Imagine you take an arbitrary conv net, pretrained on ImageNet, and want to fine tune it for your very own classifier or so. That means you take the whole pretrained conv net, add a fully-connected layer or two on top and start training. Now your FC layer is randomly initialized and produces garbage, while the conv net already produces somewhat meaningful features on images. Thus you start with only training the FC layer, until it is trained to a point where the somewhat-good features from the cnn are fully utilized by the FC layer to output decent classification. This is the whole freezing/unfreezing part. Now consider that different parts of the network might need to be adjusted at different rates. For example the very first conv layer most probably looks like a gabor filter bank, basically extracting contrasts in all directions. There is probably no need to change a lot in this layer. However the last few conv layers probably encode very high level features that are rather specific to ImageNet. Those layers might need more training. The solution is thus to use a rather low LR for the early layers (maybe around 1/100th or so) and higher lr for the later layers. Starting with the conv net frozen is kind of an extreme setting of this, since the lr for all layers but the ones you attach is set to 0.

… that came out longer than expected Cheers

DrC · June 19, 2019, 7:15pm

Thanks Dominik, I appreciate your help and time. I understand well the LR finder and fine-tuning, my issue is with the CLR scheduling but it is becoming more clear after reviewing One Cycle Policy, I was looking solely into CLR before. Thanks.
My conclusion is that I still want to implement it to better understand it.

For anybody who comes across related concerns, I would refer to the following articles/code/papers that have personally helped me understand the concept:

fastai doc of One Cycle Policy:
The 1cycle policy

Blogs:

Code Implementation:

and Leslie Smith Papers:

Cheers