Resuming training with correct LR (in One Cycle schedule) after instance shutdown

I’m training a fairly large model on spot / preemptible instances to save money but I have the problem that the server shuts down in the middle of a cycle.

Is there a Callback or something similar which saves the LR schedule at each epoch so I can resume training at the exact point in the case of a shutdown?

You can start again training at a given epoch with start_epoch=...

4 Likes

Ah thank you so much! That’s exactly what I’m looking for :blush:

How does preemptible instances work? Does it restart on it’s own when it’s available, or do you have to manually do that? Does it restart your script automatically?

By default you have to do it manually. It’s kind of annoying but if you have more time than money then it can be nice. I’m pretty sure you could create a startup script to manage this or run on a kubernetes cluster to avoid shutdown altogether but I haven’t figured out how to do that yet (if anyone has, please mention it!).

Quick follow up – is there also a way to see what LR is at each epoch?

You can see the LRs per iteration, stored in learn.recorder.lrs. Check len(learn.data.train_dl) to know how many iterations there are in an epoch.

1 Like

Perfect. Thanks again for the quick replies, very much appreciated!