New AdamW optimizer now available

Thanks to @anandsaha, we now have the new AdamW optimizer available in fastai! I haven’t yet tested it myself, by Anand has provided a handy notebook showing its use and impact on various datasets in the notebook adamw-sgdw-demo.ipynb. I’d be really interested to hear if anyone tries this, and if so, whether it helps with overfitting and/or training times. For those interested in the details, the paper is: https://arxiv.org/abs/1711.05101

Here’s the PR, which is a great example of the process of contributing more complex code to fastai: https://github.com/fastai/fastai/pull/46

42 Likes

Awesome @anandsaha! Inspiring stuff.

1 Like

Thanks @jeremy for the merge and your constant motivation :slight_smile: !

Here I am putting down the usage:

Usage

Now, fastai user can use 3 approaches to take care of weight regularization:

a) First off, not using weight decay at all

learn.fit(lrs=0.01, n_cycle=3)

Or

lr = 0.01
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3)

b) Use the old way of weight regularization (A weight decay factor which, according to the paper, is placed in the wrong place in Adam (and other optimizers), and also is not decayed)

Single learning rate and weight regularization factor (wds):

learn.fit(lrs=0.01, n_cycle=3, wds=0.025)

Differential learning rate and weight regularization factor (wds):

lr = 0.01
wd = 0.025
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd])

c) Use the new way of doing things without restart: Corrected placement of weight regularizer, weight decay with time (if lr changes with time)

Single learning rate and weight regularization factor (wds):

learn.fit(lrs=0.01, n_cycle=3, wds=0.025, use_wd_sched=True)

Differential learning rate and weight regularization factor (wds):

lr = 0.01
wd = 0.025
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd], use_wd_sched=True)

d) Use the new way of doing things with restart (recommended): Everything in c, plus use cosine annealing of lr

Single learning rate and weight regularization factor (wds):

learn.fit(lrs=0.01, n_cycle=3, wds=0.025, use_wd_sched=True, cycle_len=1, cycle_mult=2)

Differential learning rate and weight regularization factor (wds):

lr = 0.01
wd = 0.025
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd], use_wd_sched=True, cycle_len=1, cycle_mult=2)
  • Tune the parameters according to your use case.
  • It is always better to use (d) than ( c), because with ( c) none of the factors which decay the weight change (like, LR, cycle len etc.), so the weight regularization factor remains constant.

Switch the optimizer

You can choose to use a different optimizer than the default in fastai, which is SGD, like so (I am using Adam here):

import torch.optim as optim
learn = ConvLearner.pretrained(arch, data, precompute=False, opt_fn=optim.Adam)
lr = 0.01
wd = 0.025
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd], use_wd_sched=True, cycle_len=1, cycle_mult=2)

Thoughts

What I found is that it is wise to bring in weight decay if your model is overfitting (as oppose to using it right from the start of your model training)

Overfitting would be if your training loss is excellent but validation losses are terrible.

A general workflow might be to train without weight decay for a few epochs, observe the trend, and then apply weight decay to generalize i.e. to perform better on unseen data, in this case the validation set.

@jeremy does that sound right in your experience?

54 Likes

Outstanding work @anandsaha getting this implemented! Excited to try it out now :slight_smile:

2 Likes

Much impressed @anandsaha! :slight_smile: And thank you for this great writeup on how to use the new functionality - bookmarking for reference :slight_smile:

1 Like

Awesome job @anandsaha. I’m sure you had to work really hard to get this done.

1 Like

Thanks @ravivijay, @jamesrequa, @radek, @apil.tamang :slight_smile:

If you happen to use this feature, let us know how it goes :slight_smile: Specifically, you may contrast (b) and (d).

Awesome @anandsaha. You inspire us all! :+1:

1 Like

Yes, but perhaps I’d try data augmentation and dropout first. Although note that a little weight decay can sometimes make it easier to train, by making the loss function surface smoother.

7 Likes

@anandsaha Anand, I am trying to run the demo notebook. Did you create the validation set dir and what percentage of training set did you use to created for cifar10 ?

Get the Cifar10 dataset from files.fast.ai/data. It has data segregated into train and var.

Got it. Thanks.

1 Like

Hey gang, please try using this new feature in your competition notebooks and the lesson notebooks along with optim.Adam, along with various levels of weight decay (and maybe decreasing dropout sometimes too), and let us know:

  • Can you get it to train more quickly than without this feature?
  • Can you get better results than you had before?

Note that you’ll need to run lr_find again since your optimal learning rate will be different with this approach.

Let us know if you have any questions about how to get this working. Hopefully you’ll find it helpful…

6 Likes

Woah ! Super neat stuff @anandsaha ! Trying it out now. :beers:

Thought @anandsaha and @jeremy might like to know - just got this link at the top of my google feed:


It discusses the adamw addition to fastai (and links to learning rate posts from students as well).
Is Sebastian Ruder in the program? Seems like a great post.

7 Likes

No he’s not - he’s the DL researcher I mentioned a couple of weeks ago; he’s one of the best in the world IMO. He saw what’s been happening in our program and I offered to share the lesson 4 video with him. That helped inspire this post - we’ve also started discussing doing some joint research early next year (which hopefully will make its way into part 2 of the course).

9 Likes

That was a great article, looks like we’re working on some of the cutting edge highlights here in the course! Also, nice template for technical writing, the article reads like a blog and is as informative as any ML paper.

3 Likes

I almost feel like we should have a wd_find() to go alongside lr_find() when we are using weight decay.

6 Likes

Any success so far? I had one unsuccessful try.

I tried @anandsaha’s AdamW on the movielens notebook and it was a little worse. Anyone tried dogs & cats, or planet, or rossmann?