Should lr_find() be used before unfreeze() or after for fine-tuning?

viraat · October 25, 2018, 12:03pm

I’m curious about whether learn.lr_find() should be used before or after unfreezing the layers of the network.

In the first lesson’s notebook, Jeremy runs lr_find() before unfreezing the network and then uses those learning rates to fine tune the network.

I’ve plotted the lr_find() plots for before and after unfreezing and they look slightly different.

I’m wondering which would be the right approach for fine-tuning?

dhoa · October 25, 2018, 12:23pm

Your 2 plots are different evidently because the weights in your last layer have changed.

I think lr.find() can be used in both cases.

Before unfreezing, run lr.find() and pass a fix value of lr because it will affect just the last layer.

After unfreezing, run lr.find() again and pass a slice of lr because you don’t want to change too much the very beginning layers.

Hope that helps

viraat · October 25, 2018, 1:33pm

I’m sorry if I was not clear. The steps I followed were:

Initialize a learn object
Train for 5 epochs using fit_one_cycle()
Run lr_find() and get the plot
call learn.unfreeze()
Run lr_find() and get the plot

Does running lr_find() change the weights of the network (cc @sgugger)? If not, then I don’t see how the weights in the last layer would have changed.

I think for fine-tuning it makes sense to unfreeze() and then run lr_find().

dhoa · October 25, 2018, 1:45pm

Ah, I was misunderstanding your process. But anyway, I think the 2 plots are different:

When you call lr.find() , before unfreeze, you let the weights in only the last layer change. Its loss function will act differently compare to the after unfreeze case. In that case, all the weights allowed to be changed.

(Actually, in the first batch your 2 models are the same. But after some batchs, the weights are totally different )

Imagine lr.find() is similar to train your model in one epoch. The different is it will stop when the loss increase drastically.

sgugger · October 25, 2018, 1:48pm

Even if you have the exact same network, you will probably get different curves for lr_find since the training is always random (we shuffle the data in batches). lr_find doesn’t change the weights: it saves the model before doing anything else then loads it back at the end.

In this case, you should probably use lr_find after unfreezing, and you can even use differential learning rates while running lr_find.

gerardo · October 26, 2018, 12:33am

When you use lr_find Jeremy always mentions that we need to see the chart to find out he correct slice for the next step.

Is there any way to obtain that range from the lr_find function? Or somewhere else?

Lr_Range =learn.lr_find(…)
Learn.fit(…,Lr_Range)
…

viraat · October 26, 2018, 10:30am

You might wanna have a look at this thread: LR finder for fine-tuning

hiromi · October 29, 2018, 3:51am

I am wondering whether we could use the lr_find result before unfreezing can tell us the LR for the later layer, and lr_find result after unfreezing can tell us the LR for the earlier layers.

In other words, say you run LR finder without unfreezing, and it says things start to get worse at 1e-4. You unfreeze and run LR finder again and it says things get worse around 1e-5. Using Jeremy’s “pick LR way well before things gets worse for earlier layers”, we would pick max_lr=slice(1e-6, 1e-4).

Just a gut feeling and not based on ay experiments. What do you guys think?

asutosh97 · October 29, 2018, 7:00pm

so you mean to say let’s use both the results from how they came from to optimize for the best? Sounds interesting.

hiromi · October 30, 2018, 1:22am

We should try it out and see how it goes

viraat · October 30, 2018, 10:30am

It would definitely be interesting to see. Do keep us updated on any results that you find.

From what I understand, I don’t think there should be a big difference. Additionally, the graph shown from lr_find() is only for the last layer.

What is really neat is that you can pass discriminative learning rates to the LR finder:

avishalom · November 2, 2018, 4:33am

I am trying to train a classifier on a bunch of images (144 images in 4 categories. )
after running fit for 4 epochs, and then (before unfreezing) lr_find
i get
lr_1
that looked like a good range for the unfrozen would be 1e-3, 1e-2
but after trying that i get
lr_2

next attempt, more images.

Lankinen · November 2, 2018, 4:44am

First of all I just want to say that Im very interested to hear how much better results you can get with these methods. I believe that it shouldn’t make huge difference. But I also need to ask question about this graph above. I sometimes got same kind of shapes and I think I have the same problem of small amount of data but the question is that is it always too small train set or can we get this kind of graphs in some other cases too?

khengoon · November 3, 2018, 2:47pm

This is my experience:

WITHOUT unfreeze()
learn.freeze()
learn.lr_find()
WITH unfreeze()
learn.unfreeze()
learn.lr_find()

I experimented with 2 sets of training rate based on lr_find() output: One with a higher learning rate (based on without unfreeze()) and another with a lower learning rate. I continued training .fit_one_cycle() for the same number of epoch. Eventually both sets of learning rate resulted in very similar error_rate.

Kornel · November 7, 2018, 12:49pm

Have the same issue.
When I just load imagenet weights, after lr_find() plot looks more like on course site (decreasing and then jumping infinity) but I always got similar plots as yours (straight and then infinity) after first learn.

I think that mean, network is already finetuned, and won’t easily learn more.
There was an example that network is a multidimensional function of valleys and hills, and we want to find the lowest valley, so I used two approaches:

learn again with slightly different parameters (restart all and find better valley)
learn unfreezed with lr near to the left side of the plot (correct valley, go to its bottom)

I tried both and usually, it progress a bit (1-5%)
For more, I think you need to check out your dataset, model size, bach size etc.

Edit: As in lesson3 for second step Jeremy choose max_lr = slice(a/10, b/5) where a last value on not-increasing-line and b is value on previous lr_find plot.
Seems to be better.

DYmusicguy · June 4, 2019, 6:46pm

If learn.lr_find() yields non-deterministic results, how should we choose the range for lr? One time running it shows that 1e-04 gives a low error, while another says that 1e-03 is much better. Is there really a best range for lr?