Thanks @jeremy for the merge and your constant motivation !
Here I am putting down the usage:
Usage
Now, fastai user can use 3 approaches to take care of weight regularization:
a) First off, not using weight decay at all
learn.fit(lrs=0.01, n_cycle=3)
Or
lr = 0.01
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3)
b) Use the old way of weight regularization (A weight decay factor which, according to the paper, is placed in the wrong place in Adam (and other optimizers), and also is not decayed)
Single learning rate and weight regularization factor (wds):
learn.fit(lrs=0.01, n_cycle=3, wds=0.025)
Differential learning rate and weight regularization factor (wds):
lr = 0.01
wd = 0.025
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd])
c) Use the new way of doing things without restart: Corrected placement of weight regularizer, weight decay with time (if lr changes with time)
Single learning rate and weight regularization factor (wds):
learn.fit(lrs=0.01, n_cycle=3, wds=0.025, use_wd_sched=True)
Differential learning rate and weight regularization factor (wds):
lr = 0.01
wd = 0.025
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd], use_wd_sched=True)
d) Use the new way of doing things with restart (recommended): Everything in c, plus use cosine annealing of lr
Single learning rate and weight regularization factor (wds):
learn.fit(lrs=0.01, n_cycle=3, wds=0.025, use_wd_sched=True, cycle_len=1, cycle_mult=2)
Differential learning rate and weight regularization factor (wds):
lr = 0.01
wd = 0.025
learn.unfreeze()
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd], use_wd_sched=True, cycle_len=1, cycle_mult=2)
- Tune the parameters according to your use case.
- It is always better to use (d) than ( c), because with ( c) none of the factors which decay the weight change (like, LR, cycle len etc.), so the weight regularization factor remains constant.
Switch the optimizer
You can choose to use a different optimizer than the default in fastai, which is SGD, like so (I am using Adam here):
import torch.optim as optim
learn = ConvLearner.pretrained(arch, data, precompute=False, opt_fn=optim.Adam)
lr = 0.01
wd = 0.025
learn.fit(lrs=[lr/100, lr/10,lr], n_cycle=3, wds=[wd/100, wd/10, wd], use_wd_sched=True, cycle_len=1, cycle_mult=2)
Thoughts
What I found is that it is wise to bring in weight decay if your model is overfitting (as oppose to using it right from the start of your model training)
Overfitting would be if your training loss is excellent but validation losses are terrible.
A general workflow might be to train without weight decay for a few epochs, observe the trend, and then apply weight decay to generalize i.e. to perform better on unseen data, in this case the validation set.
@jeremy does that sound right in your experience?