Thread for Blogs (Just created one for ResNet)

I have created a ‘DL’ list on my handle for finding all the news/activity in one place.
My Handle is bhutanisanyam1

Another entry from me :slight_smile:

I had a tough time with some of the issues I encountered and some of the info I share could have saved me quite a bit of frustration and time. Also, I kept reading about the dynamic computation graph but the idea is way cooler and simpler in practice than the name would imply :slight_smile:

BTW I fully realize we are not getting into the nitty gritty details of PyTorch just yet, but who knows what the next lecture will bring :slight_smile: @jeremy mentioned that we will learn PyTorch and I wanted to understand the code a little bit better so started with the official tutorials which led to frustration which led to this…

The official tutorials are really good BTW and I would fully recommend them - which I actually do in the post - but probably some of the info could have been made a bit more explicit or had I had more recent experience with numerical computations, I would have probably not run into some of the stability issues. Anyhow, probably more people are in my shoes so maybe the post can be of help :slight_smile:

9 Likes

I’ve just published a post about estimating a good learning rate.

I added links to few other Medium posts from this thread into the end of my post. Cross-linking will help to get more traffic to our blogs in general.

7 Likes

I really like this! And I love the idea of cross-referencing other materials that can help explain things.

One thing that this post and everyone’s post should mention is that you’re taking the new course in person, and that it’s not available to the general public yet - but will be at the end of the year at course.fast.ai (which currently has last year’s version).

2 Likes

Thank you, Jeremy. I added this note to my post.

Yes, I’ll correct that

And here is my attempt to make practitioners aware of Cyclic Learning Rate :slight_smile:

Blogged about it here:

The Cycling Learning Rate technique

4 Likes

Nice job @anandsaha . Although I think you’re confusing some issues:

  • The learning rate finder happens to be mentioned in the CLR paper, but has nothing to do with CLR otherwise
  • SGDR is not an optimizer, it’s a different annealing schedule
  • CLR is an annealing schedule that’s not used at all in the fastai library.

So I’d suggest covering SGDR, not CLR, since SGDR solves the same problem but better :slight_smile:

4 Likes

Thanks @jeremy for that feedback!

The learning rate finder happens to be mentioned in the CLR paper, but has nothing to do with CLR otherwise

I mentioned the upper and lower rate finder as mentioned in the paper, was it not it’s original contribution?

SGDR is not an optimizer, it’s a different annealing schedule

Aah… correcting now.

CLR is an annealing schedule that’s not used at all in the fastai library.

Yes I mentioned that ‘The fastai library uses CLR to find an optimal LR and SGDR as the optimizer.’ Now that I see your post, you mentioned fastai used the idea from CLR. I will correct it. Will that be appropriate statement?

So I’d suggest covering SGDR, not CLR, since SGDR solves the same problem but better

Absolutely! I am going through that paper :slight_smile:

Btw I also discovered that fastai student Brad Kenstler implemented the CLR scheduler in keras which got ported to PyTorch awating review to merge to master. Amazing!

1 Like

No, the main contribution was the idea of continuously moving the LR both down and up. Previously people had generally only decreased LR.

The idea of the “LR finder” was an additional contribution, but is largely orthogonal.

So I used an idea from the CLR paper (the idea of the LR finder), not the idea. :slight_smile:

The SGDR paper shows very impressive SoTA results, especially along with snapshot ensembling.

Hope that clarifies a bit…

2 Likes

Got it, thanks for the insight! :slight_smile:

I found that his implementation of cycling is different than fast ai’s pytorch. He updates rates in a linear fashion (increase and decrease). Whereas fastai’s pytorch decreases with cosine annealing and immediately restarts back to the upper learning rate. See the difference in the LR plots:
Fast AI Pytorch

bckenstler’s Method

1 Like

Yes, the one he implemented (and I wrote about) is for from this paper.

The one in fastai are from these papers:
https://arxiv.org/abs/1608.03983
https://arxiv.org/abs/1704.00109

(also @jeremy’s reply has some more info)

3 Likes

Great article @apil.tamang !

Just one comment : in my understanding, you must use cycle_save_name in learn.fit() if you want to save weights after each cycle_len and get at the end the average of weights.

1 Like

Is that the way it works tho, does cycle_save_name give you the average of the weights? Or does it save the one that has the minimum validation loss (i.e. like the Keras model checkpoint callback)? cc @jeremy

It saves after every cycle. It’s up to you to load them up and average them. See planet_cv.ipynb for an example.

5 Likes

So is this similar to the effect of snapshot ensembling then if you were to use cycle_save_name and then took the average of the preds generated from each of those saved weights from each cycle with the idea that perhaps each of these found some unique local minima and thus extracted some unique information? So it would follow that this would possibly give you a better result than just choosing one of those saved weights because it had the minimum validation loss?

2 Likes

Yes it’s exactly that :slight_smile:

1 Like

Beg to differ, and even sorrier I haven’t actually tried this out, but wanted to chime in…

I wonder if taking the average of the weights would be a good idea in an ensemble predictions. It makes sense to take the final predictions, and average them. However, taking the average of the weights… umm… that’s a little counter intuitive at least to me.

I feel like for any model trained to a point, the weights are optimized in relation to neurons within the neural network. I strongly feel like taking the averages of these weights wouldn’t necessarily translate in a linear way. I.e. the final performance of the network with the average weights wouldn’t be the final performance of the averaged predictions (that is: using ensemble in the traditional way).

But I could be wrong. I didn’t even know this wasn’t used by default. Just my 2 cents :slight_smile:

2 Likes