Yes, I’ll correct that

And here is my attempt to make practitioners aware of Cyclic Learning Rate

Blogged about it here:

The Cycling Learning Rate technique

–

Nice job @anandsaha . Although I think you’re confusing some issues:

- The learning rate finder happens to be mentioned in the CLR paper, but has nothing to do with CLR otherwise
- SGDR is not an optimizer, it’s a different annealing schedule
- CLR is an annealing schedule that’s not used at all in the fastai library.

So I’d suggest covering SGDR, not CLR, since SGDR solves the same problem but better

Thanks @jeremy for that feedback!

The learning rate finder happens to be mentioned in the CLR paper, but has nothing to do with CLR otherwise

I mentioned the upper and lower rate finder as mentioned in the paper, was it not it’s original contribution?

SGDR is not an optimizer, it’s a different annealing schedule

Aah… correcting now.

CLR is an annealing schedule that’s not used at all in the fastai library.

Yes I mentioned that ‘The fastai library uses CLR to find an optimal LR and SGDR as the optimizer.’ Now that I see your post, you mentioned fastai used the *idea* from CLR. I will correct it. Will that be appropriate statement?

So I’d suggest covering SGDR, not CLR, since SGDR solves the same problem but better

Absolutely! I am going through that paper

Btw I also discovered that fastai student Brad Kenstler implemented the CLR scheduler in keras which got ported to PyTorch awating review to merge to master. Amazing!

No, the main contribution was the idea of continuously moving the LR both down *and* up. Previously people had generally only *decreased* LR.

The idea of the “LR finder” was an additional contribution, but is largely orthogonal.

So I used **an** idea from the CLR paper (the idea of the LR finder), not **the** idea.

The SGDR paper shows very impressive SoTA results, especially along with snapshot ensembling.

Hope that clarifies a bit…

Got it, thanks for the insight!

I found that his implementation of cycling is different than fast ai’s pytorch. He updates rates in a linear fashion (increase and decrease). Whereas fastai’s pytorch decreases with cosine annealing and immediately restarts back to the upper learning rate. See the difference in the LR plots:

Fast AI Pytorch

bckenstler’s Method

Yes, the one he implemented (and I wrote about) is for from this paper.

The one in fastai are from these papers:

https://arxiv.org/abs/1608.03983

https://arxiv.org/abs/1704.00109

(also @jeremy’s reply has some more info)

–

Great article @apil.tamang !

Just one comment : in my understanding, you must use `cycle_save_name`

in `learn.fit()`

if you want to save weights after each `cycle_len`

and get at the end the average of weights.

Is that the way it works tho, does `cycle_save_name`

give you the average of the weights? Or does it save the one that has the minimum validation loss (i.e. like the Keras model checkpoint callback)? cc @jeremy

It saves after every cycle. It’s up to you to load them up and average them. See `planet_cv.ipynb`

for an example.

So is this similar to the effect of snapshot ensembling then if you were to use `cycle_save_name`

and then took the average of the preds generated from each of those saved weights from each cycle with the idea that perhaps each of these found some unique local minima and thus extracted some unique information? So it would follow that this would possibly give you a better result than just choosing one of those saved weights because it had the minimum validation loss?

Yes it’s exactly that

Beg to differ, and even sorrier I haven’t actually tried this out, but wanted to chime in…

I wonder if taking the average of the weights would be a good idea in an ensemble predictions. It makes sense to take the final predictions, and average them. However, taking the average of the weights… umm… that’s a little counter intuitive at least to me.

I feel like for any model trained to a point, the weights are optimized in relation to neurons within the neural network. I strongly feel like taking the averages of these weights wouldn’t necessarily translate in a linear way. I.e. the final performance of the network with the average weights wouldn’t be the final performance of the averaged predictions (that is: using ensemble in the traditional way).

But I could be wrong. I didn’t even know this wasn’t used by default. Just my 2 cents

I think by “average of weights” in this case means loading the weights and predicting with each individually then taking the average of those predictions, not taking an average of the actual weights themselves. Yea I agree that would be kinda strange lol

Btw here is a paper about snapshot ensembling which explains this concept in a lot more detail. Basically the point is that we can implement this technique with fastai by using `cycle_save_name`

That was the real “aha” moment for me and I’m excited to test it out.

"We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields lower error rates than state-of-the-art single models at no additional training cost, and compares favorably with traditional network ensembles"

https://arxiv.org/abs/1704.00109

I think that taking an average of weights is also a valid approach, even though we have nonlinearities. Think about dropout for example - it is exactly what it relies on. Yeah, it gives you lesser dependency between activations (one nice effect) while it also effectively trains exponentially many models averaged at runtime

Definitely looking to further voices in this discussion and will gladly stand corrected if wrong Interesting conversation

I’d be very surprised if that worked, but I can’t say I’ve tried it.

I know I’m kinda late to the party but I also wrote a post about cyclic learning rates

That’s an awesome blog post with tons of references. Thanks for sharing.