AWS Deep Learning sandbox with EFS and Spot instances support

tensoralex · June 16, 2017, 12:06am

Hi,

Main idea behind this AWS Cloud Formation template was to stay under free tier whenever is possible, but have ability to quickly provision and switch between frameworks, on-demand and spot instances without copying datasets and notebooks.

Also it allows to run several different types of instances simultaneously while sharing the same data and notebooks on Elastic File System.

When stack is deleted it reverts all resources in AWS account back, except EFS file system. That allows you to keep data and notebooks and attach it to any other instance, stack or docker containers.

Elastic File system is NFS like file system and it is more flexible from pricing perspective than EBS: you pay for amount of used space, rather than allocated space. And it grows and shrinks automatically.

There are two choices in the template:
ubuntuBootstrapAMI - uses Ubuntu 16.04 image and bootstraps it to full DL environment. At the end it launches Jupyter.
EBS root volume is 16Gb, and fast.ai notebooks are downloaded into /home/ubuntu/efs

Since bootstrapping uses modified version of install-gpu.sh script - it can run fast.ai notebooks out of the box.

awsDeepLearningAMI - uses Amazon Linux image with June version of AWS DL Image.

Both choices mounts the EFS under ~/efs path.

In order to create the stack you need at least one EC2 Key Pair.

To start: download cloud formation template deeplearning-sandbox-cfn.json

In AWS account go to Services -> CloudFormation and click Create a Stack

Then you need to browse and upload template deeplearning-sandbox-cfn.json:

After filling in parameters and clicking though confirmation screens the creation will begin.

It takes approximately 26 minutes to bootstrap instance on t2.micro. After that you can stop and start instance through EC2 console or aws cli.

After creation and bootstrapping is completed you will see status: CREATE_COMPLETE.
Note that when using spot instances CREATE COMPLETE will appear earlier than bootstrapping is finished.
You can follow bootstrapping process by tailing /var/log/bootstrap.log

I would recommend to perform first initial creation using t2.micro and after you try to run notebooks - create an AMI from the bootstrapped instance, which later can be used as Custom AMI parameter to start spot and/or p2 instances.

If you chose to CREATE-NEW-EFS, you can manage your file system under Services - > EFS

P.S. Big thanks to @jeremy and @rachel for running fast.ai project!

jeremy · June 16, 2017, 4:16am

Thanks - that’s really interesting! And really appreciate the clear explanation

tensoralex · July 1, 2017, 7:33pm

If during creation of template you see messages like this:

Most likely it means that bootstrap script did not complete successfully for some reason within 1hour 15 minutes.
This template waits for 1h15m and automatically rollbacks if does not receive success signal from the script.
You can modify this timeout after you downloaded CFN template, but before launch it:

Generally if you see that creation takes more than 30 minutes (on t2.micro fast.ai image bootstrapping takes ~ 25minutes) you can login inside the instance and monitor whats going on with bootstrap script: tail -f /var/log/bootstrap.log

tensoralex · July 1, 2017, 8:12pm

Added/fixed Jupyter URL output once image is complete (it is based on public DNS url of your newly created EC2 instance):

How I personally used this template:

Created initial fast.ai instance by using ubuntuBootstrapAMI deep learning image option and t2.micro as the instance type.
It also created a new EFS file system and added it into fstab to mount it during next startup.
Once tested @jeremy notebooks, I created an AMIs from this instance. Since all data and notebooks are under /home/ubuntu/efs - the EBS volume is only 16Gb, so it fits under AWS free tier.
Then I use these new AMIs to launch p2 spot instances to run larger workloads, since data mount already in the /etc/fstab - it will mount EFS with my data and notebooks during startup. When spot instances are destroyed - all data remains on EFS.
Similar way to start other stacks and AMIs.
For me most important that I don’t have to allocate hundreds of GBs EBS volumes if I need to once in a while run model on a large dataset - EFS grows and shrinks automatically based on amount of files on it.

P.S. There is also more “production ready cluster” CFN template created by awslabs: https://github.com/awslabs/deeplearning-cfn

Salim · July 21, 2017, 2:47am

hi @tensoralex,

I was able to setup the ubuntu t2.micro AMI, but when I create spot instances off it, I notice that file changes are not saved to the EFS. It seems like only the changes prior to the AMI are saved in the EFS and all subsequent changes are lost. It’s the equivalent of the EFS not being persistent across instances (or even the original on-demand instance).

Not sure what I’m doing wrong. Any help would be appreciated. Thanks

tensoralex · July 21, 2017, 12:58pm

hi @Salim

It appears that your EFS is not getting mounted when you start spot instance from AMI.
You should see that EFS mounted by doing df -hP or mount, you can also check if automount (cat /etc/fstab) configured correctly.

If your initial t2.micro has efs mounted (which it should),
the the most plausible reason that you are launching spot instance in different Avaliability Zone than original CloudFormation template created.

In order to workaround that - add “Mount targets” to your EFS filesystem for all your AZs:
Go to EFS:

Pick your filesystem and click Manage file system access:

Add avaliability zones which you are going to start spot instances or just simply all AZs:

Salim · July 23, 2017, 3:12pm

hi @tensoralex …thanks for getting back so quickly!

I think the problem might be that the EFS is not mounted on the original t2.micro. When I run df -hP i get the following results.

When I run the mount command directly from the cmd line, I get a “mount.nfs4: Connection timed out” error. I’ve made sure that the security groups are correct on the mount target and the original instance, but to no avail.

Any ideas? Thanks!

tensoralex · July 23, 2017, 3:55pm

@Salim try ping EFS mount point from EC2 instance (i.e ping fs-xxxxxx.efs.us-east-1.amazonaws.com) - if it resolves to an IP?
if not,
Check your VPC settings, if DNS resolution enabled?

Salim · July 23, 2017, 4:51pm

@tensoralex couldn’t ping it but the DNS resolution IS enabled. odd…

tensoralex · July 23, 2017, 5:08pm

@Salim make sure mount target is created is same AZ and subnet where your t2.micro is started.
try to mount it using EFS IP instead of DNS name.
try to create a brand new stack from scratch, if issue still persists PM me parameters and VPC configuration you have used.

Salim · July 26, 2017, 3:54am

@tensoralex your suggestion of opening up the NFS port within the security group worked. Thanks!

matdmiller · November 26, 2017, 3:50am

Is it feasible to run deep learning training directly out of EFS or is there a substantial performance hit?

tensoralex · November 28, 2017, 12:44am

if your data set has thousands small files - yes, EFS will be noticeably slower than EBS volume.
EFS is comparable by burst performance with not large GP2 volumes (<100Gb) for large files.

I eventually ended up with rsyncing datasets to local EBS volume from EFS before running training.

kadlugan · February 28, 2018, 4:32am

Hi @tensoralex,

I wanted to check with you to see whether you have evolved this cloudformation template at all since you last updated. I am interested in using it / or something similar.

I hope you or anyone else can help answer my questions:

When you spin up a new stack instance, how do you bring in new GPU resources? Are they part of the cloud formation, or do you add them in after the instance is created? Or do you just shut the formation down and restart with new workers?

I guess what I want is a AWS t2.micro master that is started with the AMI, that I can then manually/(or automaticaly up to limit, eg 2 p2.xlarge) add worker GPUS then split the pytorch task across the workers

Is this possible? Is this how your CF stack works?

Background of where I am at.

I tried some cloudformations but never settled on it as i couldn’t get persistance of my setup or have a cheap system, that can pull in GPUs as needed. Above questions

I have been using EC2 spot instances and the ec2-spotter scripts to reduce my costs, but have been wanting to have a setup that allows me to spin up GPUs as needed.

There is also this news about the Elastic GPUs but i dont think they are ready for Deep Learning Apps
https://aws.amazon.com/ec2/elastic-gpus/

Thanks
Nick

tensoralex · February 28, 2018, 6:31am

@kadlugan Haven’t updated it for a while.
for experiments/development I switched to local DL box.

and there is more mature CFN from AWS GitHub - awslabs/deeplearning-cfn: Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow to run clusters.

But really after Sagemaker has been released - even that is a little obsolete.

The idea that you create and destroy this CFN (with instance), but since data on EFS - it persists. You just provide same EFS ID next time you spin up CFN (and at that time you can change instance type)

This is really what is Sagemaker nowadays can do.

Yes, Elastic GPUs are not suitable for DL.

Based on what you described - take a look on AWS Sagemaker.
I was working on pytorch-fastai docker image for Sagemaker, but haven’t had a chance to integrate it with Sagemaker.

CyExy · March 13, 2018, 1:17am

Hi @tensoralex, I’m looking into Sagemaker and using fastai library with it. It looks great for getting the model deployed. The recent update adds PyTorch support, I was wondering is it still necessary to use Docker for getting fastai into Sagemaker?

tensoralex · March 13, 2018, 5:23pm

Sagemaker uses inference Docker images under the hood for training and inference. They just release new docker image to support popular frameworks so it is available out of the box.
With that SageMaker supports pytorch now, in order to run fastai lesson you would have to deploy fastai libraries inside inference image or modify fastai code the way it does calls to inference images inside the code (actually I am not sure if that even feasible)
So I think, yes, there still need to have custom docker images to run fastai on Sagemaker.
Probably for every custom project that it is not MXNet or Tensorflow based - you would have to customize Sagemaker images in practice.