Locally-based workflow with version control and heavy-lifting remotes

trevor · December 31, 2017, 11:25pm

Outlined here is my current ideal workflow for this course and machine learning projects in general (small caveat: I’m new to ml but not to software). I like it because it allows developers to use their finely-tuned local text editors and supports git/GitHub for version control, which feels like a necessity for me for non-trivial projects, without losing the benefits of Jupyter. Here it is:

Pick a favorite text editor with support for Jupyter notebooks. Right now, choices appear to be Atom with Hydrogen and Visual Studio Code with the Jupyter plugin. Vim keybindings should be available for both as well for those inclined.
Establish a folder and repo for your project using either git init or git clone from a GitHub repo.
Work on regular Python (.py) files using your aforementioned favorite text editor, committing and pushing changes as you make them with git.
Whenever you need to know the result of a computationally-expensive block of Python code, leverage a remote server with a GPU thanks to the Hydrogen/Jupyter plugins.**
To communicate your results clearly, write them up in a top-level Jupyter notebook (.ipynb) file. I think this format is well-suited to top-level communications because of its ability to combine figures, code, and nicely-formatted written explanations. But, as described above, I’d rather work with basic Python files with version control for development/iteration, particularly for the details of a project.
Finally, because of the use of git/GitHub, it’d be fairly straightforward to git pull the repo onto a remote server and write the top-level there if any code snippets it needs to execute for demonstrations or results would benefit from the use of a GPU (currently, I don’t think it is possible to run snippets of a local .ipynb file remotely like it is for local .py files).

What do you guys think?

**This part is the key. Although I initially had trouble trying to connect to Crestle in this way when I first posted a few hours ago, I have since verified it works using a friend’s computer with ngrok to tunnel to the friend’s localhost and thereby run snippets of my code on their Jupyter instance. The same should be possible to set up on a generic remote server with better hardware, provided sufficient privileges.

cedric · January 2, 2018, 6:33pm

This workflow looks practical. Nice idea. I am also on the lookout for better way to streamline my workflow.

Whenever you need to know the result of a computationally-expensive block of Python code, leverage a remote server with a GPU thanks to the Hydrogen/Jupyter plugins.**

In my case, my desktop is running Ubuntu with VS Code as the editor and Jupyter plugin connected to AWS server or local server. AWS works well for me without the need of secure tunnel.

Some idea to think about. I am trying to improve my workflow to reduce the server cost by through automating the starting/stopping of the remote server whenever I need to. I think this can be done using GitHub hooks that get triggered during a git push. I am planning to give Docker container a try for this PoC. For now, there are 2 challenges all I can think of with this approach:

Latency - startup time
Portability - large dataset need to be moved around to different container (server) upon starting the container.

trevor · January 3, 2018, 6:06am

Glad to hear you like my approach.

Ultimately, I’m planning to connect to my own remote to save money (after repurposing my gaming desktop with an Ubuntu install), but I’ve set everything up on Google Cloud Platform in the meantime. It should be pretty equivalent to AWS, but it has $300 of credit for new users. I don’t need ngrok anymore; that was just for testing.

I’m not sure I fully understand what hooks you have in mind, but I’m thinking something along these lines:

Server has its own copy of the code in the git repo, which is kept in sync with GitHub
Data is ignored by git

I think that, in many cases, the data could live exclusively on the remote server (no local copy, no duplication) since we can execute any code snippets that need access to the data on the remote (likely, they’re expensive anyway) while developing the code locally.

cedric · January 3, 2018, 3:48pm

I see. I am a heavy AWS user myself. I also recently set everything up on Google Cloud Platform (GCP) as the AI Saturdays global community will be using GCP.

If you are using GitHub to host your git repo, there is this thing known as Webhooks provided by GitHub. It’s more of a software developer thing.

Webhooks provide a way for notifications to be delivered to an external web server whenever certain actions occur on a repository or organization. Webhooks can be triggered whenever a variety of actions are performed on a repository. For example, you can configure a webhook to execute whenever a repository is pushed to, etc. You can make these webhooks trigger CI builds, update a backup mirror, or even deploy to your server.

I am trying to automate the process as much as possible, so less manual steps. I mean, all I need to do is just git push to GitHub and GitHub will then deploy my changes to my AWS/GCP server without the need to remote in to your server and manually do a git pull to sync with GitHub. Doing this helps us stay focus on writing code in Jupyter Notebook plugin locally.

The data live exclusively on the remote server. As a standard practice, we don’t commit the large data files into the git repo.

trevor · January 7, 2018, 5:17am

Ah, cool. I think we’re thinking largely the same things.

I’m somewhat familiar with webhooks but not with Docker. I was confused by the part about moving the data around containers and wanted to be sure to mention it could be remote-only.

I was also uncertain as to how git activity and the need for remote uptime were necessarily correlated, as it depends a bit on one’s git habits. However, thinking on it a little more, I’m seeing value in something like:

Start the remote with a shell script or similar. Throw in a git pull locally at the end, in case you previously did some work directly on the remote.
Establish the webhook you brought up to deploy the changes to the remote whenever you do git push to GitHub. Afterward, stop the remote as part of the operation (then you don’t forget and get rack up charges).

This should work if you only push changes when you’re ready to take a break. Only manual steps to incorporate the remote would be ./start-ml-server.sh and git push.

Of course, you might already have had something better in mind I’m missing.

Brainkite · August 25, 2019, 4:59pm

Hey @trevor,
I prepared a similar setup to your’s (Atom+Hydrogen+GCP)
I manage to connect Hydrogen to the remote Kernel (Hydrogen: connect to remote kernel) and run code on it.
But I don’t know how to access the data that is stored in my gcloud machine.
I only manage to create path to local file, but I don’t know how to create path to remote files.
Maybe connecting to the kernel is not enough?
Would you have any clues?

alexsunny123 · October 15, 2020, 1:21pm

Hello,

The reason the number of samples stays the same is that the augmentation is done live during training. Imagine a black box that randomly augments a single image depending on the parameters you specified. This box is added to your pipeline. So for every single image, you randomly change it every time depending on your augmentation settings, instead of adding all images in 20 different ways to your dataset.

thanks
alexsunny

alexsunny123 · October 18, 2020, 7:40am

thanks my issue has been fixed.

alexsunny123 · October 18, 2020, 7:41am

my issue has been resolved.