I’m training a fairly large model on spot / preemptible instances to save money but I have the problem that the server shuts down in the middle of a cycle.
Is there a Callback or something similar which saves the LR schedule at each epoch so I can resume training at the exact point in the case of a shutdown?
How does preemptible instances work? Does it restart on it’s own when it’s available, or do you have to manually do that? Does it restart your script automatically?
By default you have to do it manually. It’s kind of annoying but if you have more time than money then it can be nice. I’m pretty sure you could create a startup script to manage this or run on a kubernetes cluster to avoid shutdown altogether but I haven’t figured out how to do that yet (if anyone has, please mention it!).