Resuming training with correct LR (in One Cycle schedule) after instance shutdown

zache · September 28, 2019, 3:46pm

I’m training a fairly large model on spot / preemptible instances to save money but I have the problem that the server shuts down in the middle of a cycle.

Is there a Callback or something similar which saves the LR schedule at each epoch so I can resume training at the exact point in the case of a shutdown?

sgugger · September 28, 2019, 7:51pm

You can start again training at a given epoch with start_epoch=...

zache · September 28, 2019, 10:11pm

Ah thank you so much! That’s exactly what I’m looking for

sgebrial · September 29, 2019, 1:53am

How does preemptible instances work? Does it restart on it’s own when it’s available, or do you have to manually do that? Does it restart your script automatically?

zache · September 29, 2019, 7:58am

By default you have to do it manually. It’s kind of annoying but if you have more time than money then it can be nice. I’m pretty sure you could create a startup script to manage this or run on a kubernetes cluster to avoid shutdown altogether but I haven’t figured out how to do that yet (if anyone has, please mention it!).

zache · October 1, 2019, 11:28am

Quick follow up – is there also a way to see what LR is at each epoch?

sgugger · October 1, 2019, 11:50am

You can see the LRs per iteration, stored in learn.recorder.lrs. Check len(learn.data.train_dl) to know how many iterations there are in an epoch.

zache · October 1, 2019, 12:24pm

Perfect. Thanks again for the quick replies, very much appreciated!