Ways of working with long-to-train models

PawelGodula · August 7, 2017, 7:04am

Hello everyone,

I am running my p2 instance, and my models which are increasingly longer to train. Currently, my kaggle competition model takes ~7 hours to compute. I connect via SSH to AWS from my laptop, and I keep changing locations so my SSH is closed everytime I disconnect from internet. The jupyter notebook is stopped then and all the work is lost.

So currently I need to sit in one place for 7 hours if I want my model to be trained. Or alternatively, train the model for 3 hours, save weights, change the workplace, load weights, run the model for next 4 hours, save weights, etc. I imagine, when I want to train a model which takes 20 hours, this becomes really painful logistically.

I had in mind the following options:

Desktop PC with stable connection and connect remotely via my laptop (but would require acquisition of PC)
Program resembling Screen for Linux, but I couldn’t find one for Windows
Testing higher-end instances on AWS to decrease training time (but would cost more)

There must be a way around that. I am really curious to hear what are your ways of working with long-to-train models, to get around that problem?

Thanks so much,
Pawel

msp · August 7, 2017, 7:44am

It seems that jupyter notebook still doesn’t support this well: see this issue on github. The developer suggests explicitly saving the output (in your code) of those long-running computations. I haven’t tried these, but stackoverflow has some suggestions on how to do that, for example using the %%capture feature.

eduedix · August 7, 2017, 7:51am

Do you start jupyter notebook in a tmux session?

PawelGodula · August 7, 2017, 8:40am

Seldom. I do it only when I am expecting that I will be copying some data to the instance.
Would Tmux help?

msp · August 7, 2017, 9:01am

I doubt that it would help with the problem you’re describing; your problem is not that the jupyter notebook gets terminated, but that when resuming work on a notebook that has been running, the output is not what you would like.