Question on "easy steps to train world-class image classifier":

wgpubs · November 7, 2017, 11:44pm

In the lesson 1 notebook, the steps are:

Enable data augmentation, and precompute=True
Use lr_find() to find highest learning rate where loss is still clearly improving
Train last layer from precomputed activations for 1-2 epochs
Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
Unfreeze all layers
Set earlier layers to 3x-10x lower learning rate than next higher layer
Use lr_find() again
Train full network with cycle_mult=2 until over-fitting

But shouldn’t step #8 go before step #7 so that the differential learning rates can be set based on what we learn from it?

jeremy · November 7, 2017, 11:46pm

You definitely need to do lr_find before you train, otherwise there’s no point running lr_find!.. I suspect I’m not really understanding your question…

wgpubs · November 8, 2017, 1:24am

oops, I mean … shouldn’t then step #7 go before step #6?

So after you unfreeze, you run lr_find, and based on that you set the differential learning rates, and then lastly, you train (step #8).

jeremy · November 8, 2017, 1:46am

You need to set the multiples in the LRs first, so that lr_find knows about them. You can scale them afterwards.

wgpubs · November 8, 2017, 3:18am

So to do so, we have to create a new learner correct?

I guess I’m missing how we make the learner aware of the new LR’s … I don’t see where the learning rate can be set on an existing learner instance.

jeremy · November 8, 2017, 5:14am

No you don’t have to create a new learner. The LRs are a param to fit(). You can also pass them as a param to lr_find(). Now I think about it, we didn’t actually cover this on Monday! I’ll mention it in the next class. Basically, do something like this:

lrs = np.array([1e-4,1e-3,1e-2])
learn.lr_find(lr/1000)

wgpubs · November 8, 2017, 5:20am

Ah ok … yah that was what I was missing.

I saw that lr_find() takes a start_lr and an end_lr, but I wasn’t sure how to use them.

Thanks

jeremy · November 8, 2017, 5:22am

If you’re feeling generous, a pull request with docs for those params would be nice! Especially to point out that you can pass differential learning rates to lr_find, and it’ll keep the multiples between the groups constant (but the plot will show the LR in the last layer group). No obligation though - only if you feel like it!

wgpubs · November 8, 2017, 5:33am

Absolutely feeling generous. Expect it by tomorrow.

wgpubs · November 8, 2017, 6:45am

I did this:

lrs = np.array([1e-4,1e-3,1e-2])
learn.lr_find(lrs)

… but when I tried to plot it via learn.sched.plot(), the plot looks way off (I can’t screenshot it, but it’s a diagonal line going up from left to right).

Am I missing something in your explanation?

wgpubs · November 8, 2017, 7:22am

Here is what I see:

jeremy · November 8, 2017, 12:54pm

You need lrs/1000 as I showed in my earlier reply. Otherwise you’re starting with too high learning rates.

wgpubs · November 9, 2017, 11:04pm

So if do so, and find that the now optimal lr = 1e-5 for example, does that mean we should update our differential learning rates as:

lrs = [1e-7, 1e-6, 1e-5]

jeremy · November 9, 2017, 11:57pm

Yes - or just say lr = lr/1000 for instance.

wgpubs · November 10, 2017, 12:09am

To be more clear, I’m doing this:

lr = 1e-2
... train
lrs = [1e-4, 1e-3, 1e-2]

learn.lr_find(lrs/1000)  

# looking at learn.sched.plot() the most optimal learning rate is 1e-5,
# so before more training I update my "lrs" variable accordingly ...

lrs = [1e-7, 1e-6, 1e-5]
learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)

Is that a correct understanding of using and applying what we learn from lr_find()?

jeremy · November 10, 2017, 12:42am

That’s right. Although your lrs needs to be wrapped in np.array

harry · November 16, 2017, 3:21am

I’m doing lrs/1000 and I am getting weird plots for the learning rate. Any idea what might be wrong?

jeremy · November 16, 2017, 3:27am

Looks like you’re running lr_find on a network that’s already trained a bit. In which case your chart isn’t that odd.

harry · November 16, 2017, 3:44am

Right. But given that this is Step 7 (of 8) in “easy steps to train a world-class image classifier”, the network would naturally already be pretty trained. Indeed, prior to this I’ve trained the last layers both with and without data augmentation by the time I run this.

If this is indeed the expected graph, how would you interpret the second plot? Would the appropriate learning rate for the last layers of the network be maybe somewhere between 10-3 and 10-2?

jeremy · November 16, 2017, 4:03am

I don’t have a great rule of thumb for the 2nd graph. Generally I just keep the learning rate at the same level, frankly.