If training is interrupted in the middle of a long cycle, e.g. 75/100 epochs. How do we resume training with fit_one_cycle? I assume simply using learn.load and set cyc_len = 100 - 75 is not enough right? Can we recover the hyperparameters at the point of interruption, e.g. learning rate, momentum?
You can pass along start_epoch
(here 75) in your call to fit_one_cycle
. Coupled with loading the model you had, it should be enough to resume training.
Thanks! Your answer helps a lot!
Hi,
Sorry for the noob question but, in my case, a preemption of my GCP VM interrupted my training. The vm was simply shut down.
If I resume training with fit_one_cycle, passing along start_epoch with the right epoch number, I don’t think it would work as the model I would pass would have been saved before launching fit_one_cycle, or am I missing something ?
If I’m not mistaken, How can I save the model automatically at each end of epoch so I can resume training in case of a shutdown/preemption ?
Thanks a lot for any answer on that matter and thanks for the incredible work you folks are doing here.
Regards,
Alexandre.
Is there the equivalent for fastai v2?
Checking the code it doesn’t seem like it. My computer just crashed after 24h+ of training :(.