If training is interrupted in the middle of a long cycle, e.g. 75/100 epochs. How do we resume training with fit_one_cycle? I assume simply using learn.load and set cyc_len = 100 - 75 is not enough right? Can we recover the hyperparameters at the point of interruption, e.g. learning rate, momentum?
You can pass along
start_epoch (here 75) in your call to
fit_one_cycle. Coupled with loading the model you had, it should be enough to resume training.
Thanks! Your answer helps a lot!
Sorry for the noob question but, in my case, a preemption of my GCP VM interrupted my training. The vm was simply shut down.
If I resume training with fit_one_cycle, passing along start_epoch with the right epoch number, I don’t think it would work as the model I would pass would have been saved before launching fit_one_cycle, or am I missing something ?
If I’m not mistaken, How can I save the model automatically at each end of epoch so I can resume training in case of a shutdown/preemption ?
Thanks a lot for any answer on that matter and thanks for the incredible work you folks are doing here.
I think I found the solution, here
Is there the equivalent for fastai v2?
Checking the code it doesn’t seem like it. My computer just crashed after 24h+ of training :(.