I think it is easier to think about this in terms of epochs. Epochs are also the scale of the parameters passed to learner.fit
What I mean by that is if you pass cycle_mult=1… well, nothing will happen After a cycle ends, its length will be multiplied by… 1.
If you start with a cycle_len of 1 (that is a cycle lasting one entire epoch) and a cycle_mult of 2, then after the first epoch (one full cycle), the length will be multiplied by 2, so the next cycle will last 2 epochs. After that (we will have finished epoch nr 3), we multiply by 2 again, and we get a cycle length of 4. So the next cycle will take us all the way through epoch 7. And so on.
I wouldn’t really worry too much about getting this perfectly right. I don’t think anyone knows what ‘perfect’ parameters in this context might be and one of the great benefits of cosine annealing is that it lets us get away with nearly anything we throw at it.
Also, the defaults that Jeremy shares in lecture are a great starting point. From my experience, running training for 2 or 3 cycles, with a cycle multiplier of 2, is a great starting point and something I would use for a lot of scenarios.