Best practice for further training after running a 1cycle policy?

PPPW · December 24, 2018, 3:09am

After running a 1cycle policy training, if we want to do more training, I wonder what would be the best practice in this case?

(1) Do a 1cycle policy again with some learning rate (or the same lr_max?). This will be similar to the SGDR with a lot of cos annealing cycles (with same lr_max) in the lecture.

(2) Train with very small learning rates.

(3) Rerun the 1cycle policy from the beginning but use more epochs (so it’s really "1"cycle). Of course this probably means ditching the previous 1cycle results.

I have done some experiments but haven’t got consistent results so I’d like to get some suggestions. Any help will be really appreciated!

digitalspecialists · December 24, 2018, 3:16pm

Mostly 3. I usually go through a period of finding the best bang for the buck cycle length. This isn’t always straightforward, since 1cycle is so good with different lengths there is no ‘cliff’. If I try a length that works, I’ll then try 70%, 200%, 300% of the length, and see what happens. It can be quite an endeavour finding ‘best’ max_lr, divpct, pctstart, diff-lr’s, cycle lenght, wd, dropout when 1cycle does a good job of regularising effects, but perhaps that’s the very point.

Sometimes 2/1. If a run (generally a long running CV fold where I don’t have time to re-run) doesn’t perform as well as others, I’ll then perform another short run (<10 epochs) where max of the run is the previous min, and the div pct is narrow, eg 2 ie half. It rarely works.

PPPW · December 24, 2018, 4:11pm

Hi @digitalspecialists, thanks for the reply! With my limited experiments I found (3) works better than (1)&(2), but (3) means I need to re-train from the beginning.
For some problems I don’t have the resource to do a search for the best cyc_len, max_lr, divpct, pctstart, that’s why I’m hoping I can still improve a trained model (when I plot the train/val loss, it looks like it’s still not overfitting). I observed that running (3) for large cyc_len works better than running (1) or (2), which seems to make sense. But it’s just my limited tests and I don’t have the resource to do a systematic comparison at this point…