Automated Learning Rate Suggester

aychang · April 20, 2019, 4:31pm

Currently in Fast.ai’s learning rate (LR) finder for its 1cycle learning policy, the best way to choose the learning rate for the next fitting is a bit of an art. Recommended methods include choosing the LR at the steepest decline of loss or 10x prior to the minimum loss. Like others, while I have found the LR finder very useful, I have had trouble automating the selection of a “good” learning rate. Fast.ai’s current suggestion in the LR finder, the point at which the gradient of the losses in respect to the LR is at its lowest, didn’t work well for me in some cases. When training classifiers on unstructured text, this approach hasn’t worked well as I unfreeze more layers.

However, I am looking to automate the training of dozens of models on different datasets, so this is a problem that needs to be solved. To address this, I came up with a method that automates the selection of a LR in Fast.ai, and so far it’s been working pretty well.

Here is the code:

def find_appropriate_lr(model:Learner, lr_diff:int = 15, loss_threshold:float = .05, adjust_value:float = 1, plot:bool = False) -> float:
    #Run the Learning Rate Finder
    model.lr_find()
    
    #Get loss values and their corresponding gradients, and get lr values
    losses = np.array(model.recorder.losses)
    assert(lr_diff < len(losses))
    loss_grad = np.gradient(losses)
    lrs = model.recorder.lrs
    
    #Search for index in gradients where loss is lowest before the loss spike
    #Initialize right and left idx using the lr_diff as a spacing unit
    #Set the local min lr as -1 to signify if threshold is too low
    r_idx = -1
    l_idx = r_idx - lr_diff
    while (l_idx >= -len(losses)) and (abs(loss_grad[r_idx] - loss_grad[l_idx]) > loss_threshold):
        local_min_lr = lrs[l_idx]
        r_idx -= 1
        l_idx -= 1

    lr_to_use = local_min_lr * adjust_value
    
    if plot:
        # plots the gradients of the losses in respect to the learning rate change
        plt.plot(loss_grad)
        plt.plot(len(losses)+l_idx, loss_grad[l_idx],markersize=10,marker='o',color='red')
        plt.ylabel("Loss")
        plt.xlabel("Index of LRs")
        plt.show()

        plt.plot(np.log10(lrs), losses)
        plt.ylabel("Loss")
        plt.xlabel("Log 10 Transform of Learning Rate")
        loss_coord = np.interp(np.log10(lr_to_use), np.log10(lrs), losses)
        plt.plot(np.log10(lr_to_use), loss_coord, markersize=10,marker='o',color='red')
        plt.show()
        
    return lr_to_use

This function takes in your Learner model and parameters that can allow for tuning of the LR selection as needed. Taking advantage of the fact that the loss skyrockets at some point when the learning rate gets high enough, I used an “interval slide rule” technique that shifts right to left on a flatter loss gradient plot of the learning rate finder, progressing until the loss value of the right interval bound comes within a close-enough distance to the left interval bound. The left interval bound is then taken as the selected learning rate, with adjustment implementable as a multiplier argument.

The plots below provide some visualizations.

Parameters:

lr_diff provides the interval distance by units of the “index of LR” (log transform of LRs) between the right and left bound
loss_threshold is the maximum difference between the left and right bound’s loss values to stop the shift
adjust_value is a coefficient to the final learning rate for pure manual adjustment
plot is a boolean to show two plots, the LR finder’s gradient and LR finder plots as shown below

The plot below is the loss gradient plot showing the interval slide rule iterating along the LR curve from right to left until the difference in loss at the left and right bounds goes below the threshold and the interval slide rule hits the optimal LR, indicated by the red dot. This plot is output by the find_appropriate_lr() method if the plot parameter is set to true.

gradientgraph

The plot below shows the same result from the Learner’s LR plot, with the new learning rate suggestion plotted as a red dot after it is adjusted (if the adjust_value parameter is anything other than 1).

newsuggestion

The plot below shows Fast.ai’s current learning rate finder graph obtained from the model/learner recorder object’s plot method. The red dot is Fast.ai’s built-in minimum numerical gradient value suggestion.
defaultgraph

The current Fast.ai suggestion of the lowest gradient loss value generates a value of 0.12, whereas our interval slide rule method generates a more robust value, 0.0479. The interval slide rule technique generates a smaller LR value than Fast.ai’s suggestion and what I would normally use. This is done because I will be automating the training of many models, and I have opted to be a little conservative in my selection methodology.

While the interval slide rule method works well for training and fine-tuning text models, the next step is to look into its performance when training different types of text models, as well as image and tabular models on Fast.ai. I am curious to see what pitfalls will be encountered when trying to select the best learning rate across different training environments.

Feedback, improvements, and discussion on alternative methods are welcome!

jeremyeast · April 20, 2019, 4:51pm

Amazing work, I’ve been looking for this for a while ! Will test to see how it reacts to different datasets when I can

sgugger · April 20, 2019, 6:23pm

Also test it on models that have been trained just after you unfreeze: those generally have a widly different shape.

aychang · April 20, 2019, 11:49pm

That’s a good idea. This method generalizes by working with the huge increase in loss as the learning rate approaches 1.0+, but I’d like to see how it works on models that produce wild shapes for potential next steps and improvements.

@jeremyeast thank you! That’s great, keep us updated on how it works.

Joan · April 28, 2019, 6:43am

Hi @aychang,
Awesome tool! I am trying it in ResNets34/50 for images and seems to work nicely.
However, I am wondering how to tune the function in order to get the lr values if you want to pass a slice function in the LR argument.
I guess the lr_to_use may be good enough for the second part of the slice but, what about the first part? It should be something just before the gradient increases. In your example should be something like 1e-2.

Any idea how to get it? Thanks!

aychang · April 28, 2019, 11:07pm

Hi @Joan,
I’m glad it seems to be working OK on the ResNet models!

You could try increasing the lr_diff parameter as it would increase the width of the slide rule and should theoretically provide a learning rate closer to the point right before loss decreases. This is also giving me some ideas on some future improvements as well, so thanks and let me know how that works!

Joan · May 1, 2019, 3:50pm

Hi @aychang,

I am trying different lr_diff and seems that this is quite specific for every dataset and I cannot find a way to generalize nicely. However 40-45 seems to be a good start but I have to run more test.

Regarding this, I am trying to get reproducible results using the function described here. However, when I run the code using num_workers = 0 when generating the DataBunch I got an error:

Traceback (most recent call last): File "/users/genomics/jgibert/Scripts/Lymphoma_Fastai_Neptune.py", line 63, in <module> selected_lr = find_appropriate_lr(learn) File "/users/genomics/jgibert/Scripts/Lymphoma_Fastai_Neptune.py", line 40, in find_appropriate_lr model.lr_find() File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/train.py", line 32, in lr_find learn.fit(epochs, start_lr, callbacks=[cb], wd=wd) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/basic_train.py", line 196, in fit fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/basic_train.py", line 111, in fit finally: cb_handler.on_train_end(exception) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/callback.py", line 322, in on_train_end self('train_end', exception=exception) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/callback.py", line 250, in __call__ for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/callback.py", line 240, in _call_and_update new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict()) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/callbacks/lr_finder.py", line 40, in on_train_end self.learn.load('tmp', purge=False) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/basic_train.py", line 265, in load state = torch.load(source, map_location=device) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/torch/serialization.py", line 368, in load return _load(f, map_location, pickle_module) File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/torch/serialization.py", line 549, in _load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: storage has wrong size: expected 4355518534081521830 got 2048

I am not quite sure why is this happening, I check some post and seems to be related with serialization. Any idea why is this happening?

Thanks!

Joan · May 2, 2019, 10:25am

An update on the error:

It seems that adding the random_seed(42) before DataBunch function (instead on adding it before learner function) does not raise any error. I am quite surprised of this behavior but this single change seems to solve the problem.

jeremyeast · May 7, 2019, 5:50pm

Hi Andrew, quick feedback to let you know I’ve been using the AutomatedLearnRateFinder and its been a little bit agressive on very imbalanced datasets, so I had to significantly increase the lr_diff parameter.

aychang · May 7, 2019, 6:53pm

@Joan
Hi Joan, thanks for the update and for bringing this up. I’ll be trying to replicate the issue, and let me know if you get any headway on this as well.

@jeremyeast
Thanks for quick feedback. And yes, the default value may produce a high learning rate depending on how the quickly the loss gradients react to the increasing learning rate. I’ve been looking into an optimized lf_diff based off various model attributes such as weight dimension size etc. Feel free to post graphs or results of the lr you’re getting with the increased lr_diff parameter. Thanks again!

LuisAnayaTan · August 17, 2019, 7:57am

You are incredible Aychang, it works perfectly, thank you very much

aychang · August 17, 2019, 3:10pm

That’s awesome to hear, if you have any feedback or questions regarding any issues please let me know

adeperio · August 23, 2019, 6:20am

Hi @aychang this looks promising!

I’ve gotten the following plots with the lr finder before. Do you think your suggester would be able to handle finding a good lr in these particular plots?

Fig1 (no clear point where loss shoots up)::

Fig2 (also no clear point where loss shoots up):

Fig3 (Really variable):

aychang · August 23, 2019, 5:32pm

Hi @adeperio, I’m glad you came across this!

I believe the suggester would be able to provide a good learning rate for the learners showing all three figures.

For Fig1 and Fig2, it’s true that there are no clear points where the loss shoots up, but if you displayed the plot where the x-axis (learning rate) went up to 1e+01 or even 1e+0 the loss should still shoot up and that would also reflect in the gradient plot.

As for the learner that’s showing the Fig3 plot, a plot we’d expect from the later training/fine-tuning of the model, the suggester relies on the gradients of the losses in respect to the learning rates so the suggester should be reasonably robust against the erratic nature of Fig3.

These are only my thoughts however, so I’d be interested to actually see how the suggester works on these plots. Good luck and feel free to let us know how it works out when you get the chance!

adeperio · August 28, 2019, 5:01am

Hi @aychang So I’ve been using your autolr finder now for a few days and it seems to be working pretty well! I don’t rely on it completely (just for prudence) but it is definitely a great help. I don’t use it for unfrozen learning rate finder runs (I haven’t had time to test this out much), but for frozen runs it perform pretty consistently.

I think it’s worth spending time tuning lr_diff. I’ve had to lower that value to 5 to get the results I need.

But anyway, nice work!

aychang · August 29, 2019, 1:30am

@adeperio ah that’s interesting, with such a low lr_diff I wonder if your optimal learning rate is really high or your lr finder has more of a hair pin change at the end. It’s cool to see how the finder reacts with your adjustments as well.

I also agree, good prudence is good practice. I’m glad the finder is helpful and seems to be working well for you.

Thanks!

adeperio · August 29, 2019, 1:57am

Hi @aychang

Yep I’m just finishing up some experiments and I am noticing some LR plots that have more of a hairpin style shape.

Kinda like this (using resnet34, 128px, with all drop out, wd, and augmentations off)

Yea I’m using it at the moment when I’m running my experiments (ie when bench marking certain hyper parameters and setups) so that I can have a consistent LR finding procedure and so that I can fire off a bunch of experiments in one go and come back to them later.

When I have settled on a set of hyper params I then try and do manual LR finding and compare that with the auto LR.

Do you think that could be a good approach with how to use the auto LR finder?

aychang · August 29, 2019, 2:37am

@adeperio I think you’ve outlined a perfect use case for this automated lr finder/suggester.

Using the finder to streamline the process of shooting off experiments to get some empirical results from hyperparameter adjustment is a great automated way to optimize. From there, manually setting the LR for the hyperparams you’ve settled on and comparing it to the auto LR is a also a great approach in my opinion

adeperio · August 29, 2019, 2:53am

Yea I think that seems like a possible good approach moving forward. Will keep using the finder I think, it’s performing well so far and I intermittently compare it to a manual LR find once in a while for checking.

aychang · August 29, 2019, 10:35pm

Awesome thanks for the feedback, and let me know if there’s anything that comes up or I can help with.

Good luck!