Question on "easy steps to train world-class image classifier":

In the lesson 1 notebook, the steps are:

  1. Enable data augmentation, and precompute=True
  2. Use lr_find() to find highest learning rate where loss is still clearly improving
  3. Train last layer from precomputed activations for 1-2 epochs
  4. Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
  5. Unfreeze all layers
  6. Set earlier layers to 3x-10x lower learning rate than next higher layer
  7. Use lr_find() again
  8. Train full network with cycle_mult=2 until over-fitting

But shouldn’t step #8 go before step #7 so that the differential learning rates can be set based on what we learn from it?

1 Like

You definitely need to do lr_find before you train, otherwise there’s no point running lr_find!.. I suspect I’m not really understanding your question…

oops, I mean … shouldn’t then step #7 go before step #6?

So after you unfreeze, you run lr_find, and based on that you set the differential learning rates, and then lastly, you train (step #8).

You need to set the multiples in the LRs first, so that lr_find knows about them. You can scale them afterwards.

So to do so, we have to create a new learner correct?

I guess I’m missing how we make the learner aware of the new LR’s … I don’t see where the learning rate can be set on an existing learner instance.

No you don’t have to create a new learner. The LRs are a param to fit(). You can also pass them as a param to lr_find(). Now I think about it, we didn’t actually cover this on Monday! I’ll mention it in the next class. Basically, do something like this:

lrs = np.array([1e-4,1e-3,1e-2])
1 Like

Ah ok … yah that was what I was missing.

I saw that lr_find() takes a start_lr and an end_lr, but I wasn’t sure how to use them.


If you’re feeling generous, a pull request with docs for those params would be nice! Especially to point out that you can pass differential learning rates to lr_find, and it’ll keep the multiples between the groups constant (but the plot will show the LR in the last layer group). No obligation though - only if you feel like it!

Absolutely feeling generous. Expect it by tomorrow.


I did this:

lrs = np.array([1e-4,1e-3,1e-2])

… but when I tried to plot it via learn.sched.plot(), the plot looks way off (I can’t screenshot it, but it’s a diagonal line going up from left to right).

Am I missing something in your explanation?

Here is what I see:

You need lrs/1000 as I showed in my earlier reply. Otherwise you’re starting with too high learning rates.

So if do so, and find that the now optimal lr = 1e-5 for example, does that mean we should update our differential learning rates as:

lrs = [1e-7, 1e-6, 1e-5]

Yes - or just say lr = lr/1000 for instance.

To be more clear, I’m doing this:

lr = 1e-2
... train
lrs = [1e-4, 1e-3, 1e-2]


# looking at learn.sched.plot() the most optimal learning rate is 1e-5,
# so before more training I update my "lrs" variable accordingly ...

lrs = [1e-7, 1e-6, 1e-5], 3, cycle_len=1, cycle_mult=2)

Is that a correct understanding of using and applying what we learn from lr_find()?

1 Like

That’s right. Although your lrs needs to be wrapped in np.array

1 Like

I’m doing lrs/1000 and I am getting weird plots for the learning rate. Any idea what might be wrong?

Looks like you’re running lr_find on a network that’s already trained a bit. In which case your chart isn’t that odd.

Right. But given that this is Step 7 (of 8) in “easy steps to train a world-class image classifier”, the network would naturally already be pretty trained. Indeed, prior to this I’ve trained the last layers both with and without data augmentation by the time I run this.

If this is indeed the expected graph, how would you interpret the second plot? Would the appropriate learning rate for the last layers of the network be maybe somewhere between 10-3 and 10-2?

I don’t have a great rule of thumb for the 2nd graph. Generally I just keep the learning rate at the same level, frankly.

1 Like