Has anyone looked at using Pytorch checkpoints during training (like save model.state_dict() ) to allow using interruptable spot instances like the new Gradient° Low-Cost instances on Paperspace or AWS spot instances directly?
https://pytorch.org/tutorials/beginner/saving_loading_models.html
It seems like saving state and model parameters regularly during training could be a more economical way to train larger models. Would it be possible to have direct support for this feature in fast.ai library?