LR finder for fine-tuning

cakeeatingbear · October 8, 2018, 5:40am

Hello, I was wondering if a LR finder for fine-tuning is in the works? To the best of my understand, the way to use a pretrained network is:

Freeze pretrained layers, and train added layers (LR finder works fine here)
Unfreeze pretrained layers and train whole network with differential training rates, as you can only have one LR for the LR finder, we wouldn’t be able to find the optimal learning rates for both the pretrained layers (which usually require a smaller LR) and the layers that you have added.

Would be happy to make a PR, as the project I’m currently working on needs this

sgugger · October 8, 2018, 11:51am

In old fastai as in new, you can pass an array of lrs in the parameter start_lr/end_lr of lr_finder, and have a LR Finder for discriminative learning rates.

gibbs · October 18, 2018, 6:51pm

Hi @sgugger,
Thanks for your contribution to this amazing package.

I still don’t understand how to get multiple LRs for fine-tuning.
Is there any notebook that I could check to see how to pass the arguments?

Is the following code snippet suppose to generate three plots/curves?
I’m getting a single plot with a single curve.

learn = ConvLearner(data, arch=tvm.resnet34)
learn.unfreeze()
learn.lr_find([1e-7, 1e-7, 1e-7], [1,1,1],100)
learn.recorder.plot()

Thanks

sgugger · October 18, 2018, 7:14pm

You will only get one curve, which are the losses against the maximum lr. What happens behind the scenes is that we go exponentially from end_lr to start_lr, whether it’s one value or an array of values.

gibbs · October 18, 2018, 7:41pm

Thanks for the prompt reply.
But I was wondering how to use this to find differential learning rates
If I get one curve how can I find the best learning rate for each group?
I am looking for a method that returns a curve for each layer group.

Thanks

sgugger · October 18, 2018, 8:21pm

This method doesn’t exist. We haven’t tried modifying separately the learning rates yet.

Taka · October 26, 2018, 10:59am

A little confused…In the first lecture, Jeremy says that setting a learning rate of,for instance, slice(10e-6,10e-4) will train the initial layers at a slower rate and the next layers at a progressively increasing rate.
So, although the lr_finder plot is for lr vs loss with a uniform rate for all layers, we change lr for all layers differently (if we want to) purely based on intuition…
Is my interpretation accurate? Is this considered modifying lr individually?Please help!

sgugger · October 26, 2018, 11:48am

The slice creates for you an array of learning rates to go with your model. You can call lr_finder with such an array of learning rate like this:

learn.lr_find(start_lr = slice(10e-7,10e-5),end_lr=slice(0.1,10))

which will make the lr_finder try differential learning rates from small values to large value and help you pick the value you put at the end of your slice: slice(lr_chosen/100,lr_chosen). Note that this is with always the same ratio between the first and the last layer (which is why I said we don’t modify them separately, more like all together).

MicPie · October 26, 2018, 12:02pm

Should be the end_lr=slice(0.1,10) or without slice()?

sgugger · October 26, 2018, 1:29pm

Yup, I made a mistake. Will edit

balnazzar · November 3, 2018, 10:34pm

@sgugger. I’m afraid I didn’t understand. Let me ask you some additional questions.

What I’m not understanding here is “whether it’s one value or an array of values”. Specifically, how the LR finder goes from start to end in case you give it arrays of values for both.

Here, I’m not understanding what follows:

which will make the lr_finder try differential learning rates from small values to large value

For example, will it start from 10^-7 or from 10^-5?

and help you pick the value you put at the end of your slice

I was unable to construe the above at all.

Thanks in advance.

Why not to implement an approach like the one suggested by the OP, like plotting a graph for each bloch? In such a way, one could adjust the learning rates separately, avoiding to spoil the pretrained weights too much.
But I’m sure the answer to this question lies in the stuff among your previous answers I didn’t understand.

sgugger · November 4, 2018, 1:29am

The learning rate finder will start with learning rates going from 1e-7 to 1e-5, 1e-7 being for the first layers and 1e-5 for the head (which is what is called differential learning rates). It’ll move exponentially until 0.1 to 10, with, at each time, the same ratio between the maximum learning rate and the minimum learning rate.

Then, once you see the graph, you can decide what your maximum learning rate should be (since it’s the one plotted) and pass slice(lr_chosen/100, lr_chosen) as differential learning rates.

balnazzar · November 4, 2018, 2:16am

Thanks!

Crystal clear now. But how can we leverage the fact that the Finder works on the first layers too, when we can just plot the job it did on the head?

I think I’m getting why you do recommend to preserve the same orders of magnitude. But see my question below.

But then, we have no guarantee that what the finder found while working on the head does mirror what it found on the tail, just with a finer graining (1/100 in your example) .
E.g. lr_chosen/100 could be a wrong LR to use on the tail, and have no information about that, since we get no plot for the tail.

What I’m not catching?

Thanks!

cakeeatingbear · November 17, 2018, 3:03am

This may help you, I’ve subclassed the Recorder so that will plot the LR’s for all the layer groups.

from collections import defaultdict
class MyRecorder(fastai.Recorder):
    "A `LearnerCallback` that records epoch, loss, opt and metric data during training."
    _order=-10
    def __init__(self, learn:Learner):
        super().__init__(learn)

    def on_backward_begin(self, smooth_loss:Tensor, **kwargs:Any)->None:
        "Record the loss before any other callback has a chance to modify it."
        self.losses.append(smooth_loss.item())
        if self.pbar is not None and hasattr(self.pbar,'child'):
            self.pbar.child.comment = f'{smooth_loss:.4f}'
            
    def on_train_begin(self, pbar:PBar, metrics_names:Collection[str], **kwargs:Any)->None:
        "Initialize recording status at beginning of training."
        self.pbar = pbar
        self.names = ['epoch', 'train_loss', 'valid_loss'] + metrics_names
        self.pbar.write('  '.join(self.names), table=True)
        self.losses,self.val_losses,self.moms,self.metrics,self.nb_batches = [],[],[],[],[]
        self.lrs = defaultdict(list)

    def on_batch_begin(self, train, **kwargs:Any)->None:
        "Record learning rate and momentum at beginning of batch."
        
        if train:
            for i, lr in enumerate(self.opt.read_val('lr')):
                self.lrs[f"layer_group_{i}"].append(lr)
            self.moms.append(self.opt.mom)

    def plot_lr(self, show_moms=False)->None:
        "Plot learning rate, `show_moms` to include momentum."
        n_layer_groups = len(self.lrs)
        if show_moms:
            _, axs = plt.subplots(n_layer_groups, 2, figsize=(12, 4), constrained_layout=True)
            axs = np.array(axs).flatten()
            for i, (layer_group, lrs) in enumerate(self.lrs.items()):
                axs[i * 2].set_title(f"LR for {layer_group}")
                axs[i * 2].plot(range_of(lrs), lrs)
                axs[(i * 2) + 1].set_title(f"Momentum for {layer_group}")
                axs[(i * 2) + 1].plot(range_of(self.moms), self.moms)
        else:
            _, axs = plt.subplots(n_layer_groups, 1, figsize=(12, 4), constrained_layout=True)
            axs = np.array(axs).flatten()
            for i, (layer_group, lrs) in enumerate(self.lrs.items()):
                axs[i].set_title(f"LR for {layer_group}")
                axs[i].plot(range_of(lrs), lrs)

    def plot(self, skip_start:int=10, skip_end:int=5)->None:
        "Plot learning rate and losses, trimmed between `skip_start` and `skip_end`."
        n_layer_groups = len(self.lrs)
        _, axs = plt.subplots(n_layer_groups, 1, figsize=(8, 8), constrained_layout=True)
        axs = np.array(axs).flatten()
        for i, (layer_group, lrs) in enumerate(self.lrs.items()):
            lrs = lrs[skip_start:-skip_end] if skip_end > 0 else lrs[skip_start:]
            losses = self.losses[skip_start:-skip_end] if skip_end > 0 else self.losses[skip_start:]
            axs[i].set_title(f"{layer_group}")
            axs[i].plot(lrs, losses)
            axs[i].set_ylabel("Loss")
            axs[i].set_xlabel("Learning Rate")
            axs[i].set_xscale('log')
            axs[i].xaxis.set_major_formatter(plt.FormatStrFormatter('%.0e'))

balnazzar · November 17, 2018, 7:21pm

Cool! Try and submit a PR!

crostino · November 25, 2018, 9:06pm

@sgugger I read your The 1cycle policy blog post and I am wondering for lr_find when tuning all the other hyper-parameters should you rebuild your model with the same starting weights? Another question is does lr_find keep the updated weights after finishing, meaning is it considered to be a partial/full epoch train step?

sgugger · November 25, 2018, 11:35pm

The lr_finder is a mock training. At the end of it, the model has diverged so you have to start over with new weights (in fastai the .lr_find() method will load back the weights from the beginning).
Same, when you tune all other HP, you should probably reload the same starting weights.

mxcsyounes · March 27, 2019, 9:05am

Hello there, I have this Plot, how can I do to find learning rate ?

download

andrew77 · June 22, 2019, 1:13am

@sgugger

learn.lr_find(start_lr = slice(10e-7,10e-5),end_lr=slice(0.1,10))

Sorry for bumping so when we choose the start_lr would it be something that the sharpest gradient decrease when we do lr_find and the end of start_lr
would be something slightly after the minima.

for the end_lr is kind of fix as it follows the 1-cycle policy?

Did I undersand this correctly? thanks