I really like this! And I love the idea of cross-referencing other materials that can help explain things.
One thing that this post and everyone’s post should mention is that you’re taking the new course in person, and that it’s not available to the general public yet - but will be at the end of the year at course.fast.ai (which currently has last year’s version).
The learning rate finder happens to be mentioned in the CLR paper, but has nothing to do with CLR otherwise
I mentioned the upper and lower rate finder as mentioned in the paper, was it not it’s original contribution?
SGDR is not an optimizer, it’s a different annealing schedule
Aah… correcting now.
CLR is an annealing schedule that’s not used at all in the fastai library.
Yes I mentioned that ‘The fastai library uses CLR to find an optimal LR and SGDR as the optimizer.’ Now that I see your post, you mentioned fastai used the idea from CLR. I will correct it. Will that be appropriate statement?
So I’d suggest covering SGDR, not CLR, since SGDR solves the same problem but better
Btw I also discovered that fastai student Brad Kenstler implemented the CLR scheduler in keras which got ported to PyTorch awating review to merge to master. Amazing!
I found that his implementation of cycling is different than fast ai’s pytorch. He updates rates in a linear fashion (increase and decrease). Whereas fastai’s pytorch decreases with cosine annealing and immediately restarts back to the upper learning rate. See the difference in the LR plots:
Fast AI Pytorch
Just one comment : in my understanding, you must use cycle_save_name in learn.fit() if you want to save weights after each cycle_len and get at the end the average of weights.
Is that the way it works tho, does cycle_save_name give you the average of the weights? Or does it save the one that has the minimum validation loss (i.e. like the Keras model checkpoint callback)? cc @jeremy
So is this similar to the effect of snapshot ensembling then if you were to use cycle_save_name and then took the average of the preds generated from each of those saved weights from each cycle with the idea that perhaps each of these found some unique local minima and thus extracted some unique information? So it would follow that this would possibly give you a better result than just choosing one of those saved weights because it had the minimum validation loss?
Beg to differ, and even sorrier I haven’t actually tried this out, but wanted to chime in…
I wonder if taking the average of the weights would be a good idea in an ensemble predictions. It makes sense to take the final predictions, and average them. However, taking the average of the weights… umm… that’s a little counter intuitive at least to me.
I feel like for any model trained to a point, the weights are optimized in relation to neurons within the neural network. I strongly feel like taking the averages of these weights wouldn’t necessarily translate in a linear way. I.e. the final performance of the network with the average weights wouldn’t be the final performance of the averaged predictions (that is: using ensemble in the traditional way).
But I could be wrong. I didn’t even know this wasn’t used by default. Just my 2 cents
I think by “average of weights” in this case means loading the weights and predicting with each individually then taking the average of those predictions, not taking an average of the actual weights themselves. Yea I agree that would be kinda strange lol
Btw here is a paper about snapshot ensembling which explains this concept in a lot more detail. Basically the point is that we can implement this technique with fastai by using cycle_save_name That was the real “aha” moment for me and I’m excited to test it out.
"We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields lower error rates than state-of-the-art single models at no additional training cost, and compares favorably with traditional network ensembles" https://arxiv.org/abs/1704.00109
I think that taking an average of weights is also a valid approach, even though we have nonlinearities. Think about dropout for example - it is exactly what it relies on. Yeah, it gives you lesser dependency between activations (one nice effect) while it also effectively trains exponentially many models averaged at runtime
Definitely looking to further voices in this discussion and will gladly stand corrected if wrong Interesting conversation