Hello everyone,
I am running my p2 instance, and my models which are increasingly longer to train. Currently, my kaggle competition model takes ~7 hours to compute. I connect via SSH to AWS from my laptop, and I keep changing locations so my SSH is closed everytime I disconnect from internet. The jupyter notebook is stopped then and all the work is lost.
So currently I need to sit in one place for 7 hours if I want my model to be trained. Or alternatively, train the model for 3 hours, save weights, change the workplace, load weights, run the model for next 4 hours, save weights, etc. I imagine, when I want to train a model which takes 20 hours, this becomes really painful logistically.
I had in mind the following options:
- Desktop PC with stable connection and connect remotely via my laptop (but would require acquisition of PC)
- Program resembling Screen for Linux, but I couldn’t find one for Windows
- Testing higher-end instances on AWS to decrease training time (but would cost more)
There must be a way around that. I am really curious to hear what are your ways of working with long-to-train models, to get around that problem?
Thanks so much,
Pawel