Are there any rules of thumbs on how to choose cycle_mult when using SGDR? In particular if you have 4000 mini batches, should I restart every 400 batch?
I think it is easier to think about this in terms of epochs. Epochs are also the scale of the parameters passed to
learner.fit What I mean by that is if you pass cycle_mult=1… well, nothing will happen After a cycle ends, its length will be multiplied by… 1.
If you start with a cycle_len of 1 (that is a cycle lasting one entire epoch) and a cycle_mult of 2, then after the first epoch (one full cycle), the length will be multiplied by 2, so the next cycle will last 2 epochs. After that (we will have finished epoch nr 3), we multiply by 2 again, and we get a cycle length of 4. So the next cycle will take us all the way through epoch 7. And so on.
I wouldn’t really worry too much about getting this perfectly right. I don’t think anyone knows what ‘perfect’ parameters in this context might be and one of the great benefits of cosine annealing is that it lets us get away with nearly anything we throw at it.
Also, the defaults that Jeremy shares in lecture are a great starting point. From my experience, running training for 2 or 3 cycles, with a cycle multiplier of 2, is a great starting point and something I would use for a lot of scenarios.
So what would be the difference between using
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)vs
learn.fit(lr, 7, cycle_len=1)? Because unless I am missing something, they result in the same number of epochs (7) so hence the same amount of training? Or am I missing something behind the scenes or within the logical thinking behind the need for a cycle multiplier?