Resume an interrupted 1cycle policy training process

PPPW · February 1, 2019, 2:01am

Hi, I’d like to share an example of how to resume an interrupted 1cycle policy training process. Based on my limited experience, running a long 1cycle policy works better than running several shorter ones. However, when things go wrong, we don’t want to rerun the whole thing from the beginning, it’s a waste of time…

My solution is use callbacks to save your model (i.e., SaveModelCallback), if you got interrupted, you can load model of the last epoch and train from there. You just need to change the learning rate schedule. I have some examples here.

With this technique, you can also divide the 1cycle policy into smaller parts and execute each of them. You may want to do this if you have some limitations on how many hours you can run each time. People with powerful machines may not care about this at all, but the good thing about fastai is everyone can use it to do interesting things.

If this might be useful to more users, I can submit a PR to add this feature to the OneCycleScheduler (my solution is to inherit and modify this one). Just need to add two optional parameters (start epoch and total epochs), the API will be the same so people that are not using this feature won’t be affected.

Hope it helps!

sgugger · February 1, 2019, 2:35pm

You can definitely make a PR to add it to callbacks.one_cycle!

PierreO · February 1, 2019, 3:37pm

Great feature, definitely do a PR!

PPPW · February 2, 2019, 2:29am

Thanks, I have submitted a PR. Although we usually use 0-based indexing in Python, I used 1-based indexing for start_epoch because the epoch number printed out in the console is 1-based.

Also, in the PR I added a on_epoch_end method, because now we may have the case that the 1cycle policy ends but there’re still a few epochs left. E.g., if the user set tot_epochs to 3 but asks to train 5 epochs:

learn.fit_one_cycle(5, 0.1, tot_epochs=3)

Then in the current PR it will stop after epoch 3. There’re other ways to handle this case, e.g., raise error to let the user know, keep lr constant after 1cycle, start another 1cycle, etc. I’m not sure what we’d like to have so please change this part if needed. Thanks!