Learning rate finder: how do we know that the sample on which the "simulation" is done is representative of the rest of the dataset?


I may have missed something fundamental but there’s something I don’t get about the learning rate finder method. On what data sample size the iterations are made to see how the loss evolves depending on the learning rate? How do we know that this dataset/sample is representative? and that the learning rates will “behave” similarly on the rest of the data?

Thanks a lot!


It doesn’t use a sample, it uses the whole dataset.

Thanks a lot for the fast answer!!
But isn’t each iteration done on a mini batch? which is a sample/subset of the data? How to pick this size? The experiment to study how loss evolves with learning rates assumes somehow that there will be a certain “consistency” in behavior across batches/iterations. Doesn’t this require that each batch is a representative sample of the entire dataset?
Somehow, put in an extreme way, the batch size in an iteration can be so small that it won’t be able to “correctly guide” the “descent” or that the direction/results change from batch to batch/iteration to iteration. Or, on the other extreme, the batch size can be the size of the entire dataset in which case, there is no gain from this method: it is as if we’re learning multiple times varying each time the learning rate. Thanks again!

The batch size is whatever batch size you chose when setting up the ModelData object (defaults to 64). The code uses an exponential moving average to smooth the loss.

If the batch size is too small, it will be obvious in the chart, since the loss will move around a lot. If that happens, increase the batch size.

For those who want to have a look at how the loss evolves as a function of the learning rate for different batch sizes, i’ve found this valuable post from one of the fast.ai students :

For small batch sizes, the loss moves around a lot indeed, and is more “stable” for larger batches.
What is interesting is that even for batch sizes for which the loss does not move around a lot, what we could infer as the learning rate (“a bit before the loss starts increasing”) is different from one batch size to another (see difference between plots BS=16,32 and 64).