How to train large models (that take a lot of time)?

Currently I’m running jupyter notebooks on AWS (because I have credits left over there).

The problem is that if the model is large and takes a lot of time to train - as soon as the SSH connection between my machine and EC2 breaks > browser effectively terminates > notebook breaks, and I can’t see results of training (and have to start all over).

So I’m looking for a solution that would let me train models for long periods of time on AWS, even when my local machine is disconnected.

Some thoughts I had:

  • I tried using VNC to connect into the machine and run a notebook locally, but apparently deep learning AMIs ship “headless” and so there is no desktop environement…
  • I thought about taking the code from the notebook and preparing a training file… but I have no idea how to do that? And also that doesn’t seem very effective from experimentation viewpoint.
  • I tried looking for a way to run notebooks without the browser, but there doesn’t seem to be one?

This is my first time working with notebooks / ML, so please treat me like I’m 5. Any help super appreciated!

Thanks!

1 Like

Create a python script (e.g., train.py) and run it from the shell this way:

nohup train.py &

Output will be directed to nohup.out, so from time to time you may check its contents, or monitor what is happening with

tail -f nohup.out

2 Likes

Hi dats-vs-cogs hope you are well!
I had a similar problem which I tried to solve. I am open to more suggestions of improvement.

See my post Platform: Colab ✅ Problem - Colab session timing out after 12 hours model requires 20 hours what is the solution?

Maybe you could combine this with other suggestions and make something better.

Hope this helps

mrfabulous1 :smiley::smiley:

@mrfabulous1 one thing you can do is log how many epochs you got to on fit one cycle and then save that model, and continue training after x point :slight_smile: (not sure if this was suggested!)

1 Like

You can start the notebook in a screen session

ssh into the machine then:

screen -S jupyter

then you can run jupyter in a no browser mode with

jupyter notebook --no-browser

One hack I did (which is not safe) is to enable http access to my VM so that i can access jupyter without needing to ssh tunneling.

Hey @m_ke - thanks for the reply. I’m already doing --no-browser, but that only means that JN doesn’t try to open a browser when you launch it. Doesn’t change the fact that if you’re training something from a browser later and close the browser - it breaks.

Unless I’m missing something.

Thanks @VDM. Any good tutorials on how to convert a notebook into train.py?

Could you explain to a 5 year old what does http access to a remote server allow you to do that I can’t do through ssh? Like what are you winning here? I don’t come from an eng background, learning on the fly.

So I’m using Tmux - but that is not enough.

When connection breaks 2 things happen:

  • Terminal session ends (Tmux helps preserve this)
  • Browser session ends (Tmux is of no help here)

It’s the second part that I need a solution to.

1 Like

As long as you do not use specific colab libraries (e.g., to access gdrive), from menu file:download .py will give you a python script.
Also, shell commands from the notebook do not run directly, however you can always run them from the shell without !.

just a fancy interface, execution step by step during development/debug, easy documentation of what you do, things like these.

Timeout wont happen as long as you dont put your computer to sleep. Also allows me to access anywhere even with mobile.