Questions about the arguments to lr_find

def lr_find(self, start_lr=1e-5, end_lr=10, wds=None, linear=False, **kwargs)

In reference to lr_find from source, I’m looking for some intuition on the following:

  1. What kind of behavior (e.g., blank plots, etc…) would prompt us to modify start_lr and/or end_lr?

  2. If we are changing start_lr and/or end_lr, what guidance is there in assigning them something appropriate?

  3. What does linear do?

  4. When would we want to set linear=True?

  5. When would setting wds to something be valuable or necessary to getting a good result?

If there are notebooks and/or wiki entries that discuss the above please feel free to just point me in that direction. I’ve been looking through my notes and haven’t really found any satisfying answers to these questions.

  1. Move them closer together for finer learning rate finding (important for linear). Further away if you want to try more extreme LRs
  2. Look at past papers, or just experiment
  3. Adds a fixed LR per batch, rather than multiplying by a fixed ratio (try plot_lr to see what I mean)
  4. To find the exact point where the loss gets worse, if you’re trying to really optimize your LR
  5. Probably always a good idea to set weight decay to whatever you’ll fit with. It’ll impact the LR finder curve

Any specific papers in particular?

And thanks!

Any past papers that have looked at datasets and/or architectures similar to what you’re looking at. They’ll let you know what LR they used. Although it’s not at all common that you need to go outside the default LRs in the finder.