lr_find gives very different results depending on the num_it argument value.
if I run it with num_it == 100 (default) its minimum is at ~10e-00 so the best supposed lr would be ~10e-1
if I run it with num_it== 1000 its minimum is at ~ 10e-1, so the best supposed lr would be ~10e-2
if I run it with num_it==10000 its minimum is at ~ 10e-2, so the best supposed lr would be ~10e-3
Im wondering what is the reason for this behavior (as I would expect to get at least same order of magnitude) and also how to choose good lr
any help would be really appreciated!
Have you fixed each kind of random seed to make it reproducible?
I have checked again, actually the difference is not that big as I wrote initially.
2e-1 (num_it = 100)
5e-2 (num_it = 1000)
1e-2 (num_it = 10000)
so, the first one is only 20 times higher than the last one (not 100)
but the difference is still rather big isn’t it?
i’m working now on another task and again I see this:
num_it == 100 -> ~1e-00
num_it == 1000 -> ~1e-01
num_it == 10000 -> ~1e-02
It is rather expected actually. To the best of my knowledge, according to the original paper by Leslie Smith, and the implementation in fastai, once you trained for an iteration, you don’t “reset” your model.
So 100 iterations mean that any LR being tested will at most be applied to a model trained for 99 iterations. But if you go for num_it=1e4, it’s a hundred times more.
Now back to your specific observation, keeping the above in mind :
- Let’s take a given LR named lr_star
- if num_it=100, when the loop reached lr_star it will have been trained for N iterations (optimizing towards the global minima or at least a local minima). Let’s say the loss we find at this point is l1.
- if num_it=1e4, this time when the loop reaches lr_star, it will have been trained towards the same objective but for 100N iterations. As those 100N iterations have an LR range in the same range as before […, lr_star] it reproduces about the same training LR but repeated a hundred times. As you can understand, the previous loss l1 has most likely been reached (and passed) in those 100N iterations.
In short, raising the number of iterations, for this specific method will slide left (lower) the returned learning rate. In my opinion, that’s why the default value of 100 is well picked.
Now a suggestion if you wanna go the extra mile you can study the resetting part I mentioned:
- Define the range of LR and the resolution you want R (this time, limited only by your computation capacity)
- Subdivide this LR range (in log scale) into R values
- For each value of LR, train the model on random M samples of the training set (M=num_it previously so 100 would be good) and evaluate the model but using the trained model on the validation/test set before resetting it for the next value of LR
- Repeat until you cover the entire range (or loss explodes)
Please note it’s rather experimental, way more computationally intensive (and only an unteste suggestion of mine ), but you’ll have control over the LR resolution and you’ll get the impact on the validation loss rather than the training loss
I hope this helped, cheers!
You have to add some more. Try the following function to do it:
def setReproducibility(seed_value, use_cuda):
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Also you have to use only one worker, as parallel processing is not deterministic as well, in case the function call alone is not sufficient.
Thank you @fgfm!
it is more clear now