I have checked again, actually the difference is not that big as I wrote initially.
2e-1 (num_it = 100)
5e-2 (num_it = 1000)
1e-2 (num_it = 10000)
so, the first one is only 20 times higher than the last one (not 100)
but the difference is still rather big isn’t it?
It is rather expected actually. To the best of my knowledge, according to the original paper by Leslie Smith, and the implementation in fastai, once you trained for an iteration, you don’t “reset” your model.
So 100 iterations mean that any LR being tested will at most be applied to a model trained for 99 iterations. But if you go for num_it=1e4, it’s a hundred times more.
Now back to your specific observation, keeping the above in mind :
Let’s take a given LR named lr_star
if num_it=100, when the loop reached lr_star it will have been trained for N iterations (optimizing towards the global minima or at least a local minima). Let’s say the loss we find at this point is l1.
if num_it=1e4, this time when the loop reaches lr_star, it will have been trained towards the same objective but for 100N iterations. As those 100N iterations have an LR range in the same range as before […, lr_star] it reproduces about the same training LR but repeated a hundred times. As you can understand, the previous loss l1 has most likely been reached (and passed) in those 100N iterations.
In short, raising the number of iterations, for this specific method will slide left (lower) the returned learning rate. In my opinion, that’s why the default value of 100 is well picked.
Now a suggestion if you wanna go the extra mile you can study the resetting part I mentioned:
Define the range of LR and the resolution you want R (this time, limited only by your computation capacity)
Subdivide this LR range (in log scale) into R values
For each value of LR, train the model on random M samples of the training set (M=num_it previously so 100 would be good) and evaluate the model but using the trained model on the validation/test set before resetting it for the next value of LR
Repeat until you cover the entire range (or loss explodes)
Please note it’s rather experimental, way more computationally intensive (and only an unteste suggestion of mine ), but you’ll have control over the LR resolution and you’ll get the impact on the validation loss rather than the training loss