Resume training with fit_one_cycle

bwangwp · September 9, 2019, 5:45pm

If training is interrupted in the middle of a long cycle, e.g. 75/100 epochs. How do we resume training with fit_one_cycle? I assume simply using learn.load and set cyc_len = 100 - 75 is not enough right? Can we recover the hyperparameters at the point of interruption, e.g. learning rate, momentum?

sgugger · September 9, 2019, 6:56pm

You can pass along start_epoch (here 75) in your call to fit_one_cycle. Coupled with loading the model you had, it should be enough to resume training.

bwangwp · September 9, 2019, 7:25pm

Thanks! Your answer helps a lot!

Alexandre_DIEUL · January 31, 2020, 7:13am

Hi,

Sorry for the noob question but, in my case, a preemption of my GCP VM interrupted my training. The vm was simply shut down.
If I resume training with fit_one_cycle, passing along start_epoch with the right epoch number, I don’t think it would work as the model I would pass would have been saved before launching fit_one_cycle, or am I missing something ?

If I’m not mistaken, How can I save the model automatically at each end of epoch so I can resume training in case of a shutdown/preemption ?

Thanks a lot for any answer on that matter and thanks for the incredible work you folks are doing here.

Regards,
Alexandre.

Alexandre_DIEUL · January 31, 2020, 7:22am

I think I found the solution, here

etremblay · September 14, 2020, 12:10am

Is there the equivalent for fastai v2?

Checking the code it doesn’t seem like it. My computer just crashed after 24h+ of training :(.