I’m training a fairly large model on spot / preemptible instances to save money but I have the problem that the server shuts down in the middle of a cycle.
Is there a Callback or something similar which saves the LR schedule at each epoch so I can resume training at the exact point in the case of a shutdown?
You can start again training at a given epoch with
Ah thank you so much! That’s exactly what I’m looking for
How does preemptible instances work? Does it restart on it’s own when it’s available, or do you have to manually do that? Does it restart your script automatically?
By default you have to do it manually. It’s kind of annoying but if you have more time than money then it can be nice. I’m pretty sure you could create a startup script to manage this or run on a kubernetes cluster to avoid shutdown altogether but I haven’t figured out how to do that yet (if anyone has, please mention it!).
Quick follow up – is there also a way to see what LR is at each epoch?
You can see the LRs per iteration, stored in
len(learn.data.train_dl) to know how many iterations there are in an epoch.
Perfect. Thanks again for the quick replies, very much appreciated!