My experience keeping it on AWS free tier

zach · April 27, 2017, 6:13am

Hi all,

I’m fairly new to AWS, and I’m not a programmer by trade. I initially followed the getting started instructions for creating an EC2 instance using the 128GB ami, but being cost conscious, I wanted to try to find ways to optimize and keep on the free AWS tier for the first 12 months. I thought I would write out the rough steps I did, in the event it is helpful to any other newbies.

Getting going from “scratch”

First things first, it is easy to create a smaller EC2 instance and get it running. Thanks to the install-gpu.sh shell script, much of the install is taken care of for you. Like I mentioned, first I went with the standard 128GB ami provided by Jeremy and Rachel. Then I found a great thread on reducing size of volume from vshets. His ami used the full 30GB of free-tier-eligible EBS, and didn’t really provide me a practical way to manage things.

So, using the aws portal, I created a new, slightly smaller EC2 instance. For this pass, I went with 24GB. I have some thoughts on lowering that, which I’ll get to shortly. After logging in, creating my desired directory structure, I ran the install-gpu.sh script.

I found this generally worked, but it didn’t add the nvcc path to my ~/.bashrc file, so I had to manually add it. Adding the following two lines to my ~/.bashrc file did the trick.

export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH

Oh, and for whatever reason, jupyter notebook wasn’t recognized. So I added this line to ~/.bashrc

export PATH="/home/ubuntu/anaconda2/bin:$PATH"

I also had to downgrade keras to version 1.2.2, as mentioned in another thread.

sudo pip install keras==1.2.2

At this point, things worked and I was able to get through lesson 1. I’m currently running the first full pass at the State Farm distracted driving kaggle contest.

Next Steps

I think I want to shake things up again, create an EC2 instance with an even smaller EBS volume, just for the OS and applications. Then I can mount a secondary EBS volume as a data store. This will allow me to keep a “pure” OS volume, which I can back up as an ami, and get up and running with all my applications and settings exactly how I like them. By mounting a second EBS volume as a data directory, I can keep the big data hogs (like the 4+ GB State Farm data set) outside of my ami. If I want to back that up for any reason, I can create an EBS snapshot, but since bandwidth is relatively cheap, and you can keep a record of the commands you ran to post process a data set, I’m thinking I might just treat them as fairly disposable.

One reason this data volume vs OS volume approach is interesting to me is that AWS charges based on GB-Months of EBS provision, and I believe that is based on the full provision size, and not just the size used. So having a 128GB volume provisioned indefinitely would be much more expensive than having a small OS volume provisioned and adding on a data volume as needed for that week’s course. Then you could either save those as snapshots and delete the volumes, or just keep them and replace the data with the next project.

If I end up going this route of making an even smaller OS+apps ami, I’ll follow up to this post with more info on my approach and how it’s working for me. If you have done anything similar, please give me feedback.

Thanks,
Zach

zach · April 27, 2017, 2:52pm

So last night I got my Distracted Driving data set all prepared, and kicked off a full pass running on a p2.xlarge instance. I kicked things off around 11 PM, and jupyter notebook estimated 1 epoch would take about 7000 seconds to complete (~2 hrs). I had to go to sleep, so I left the p2.xlarge running all night, which cost me a few bucks of unnecessary charges.

First, does it seem correct that one epoch on a p2.xlarge instance with this data set would take nearly 2 hours to run?

Second, if so, what can I do to automatically stop the instance once it is done so I’m not charged for unnecessary hours. I was thinking maybe a cron job set up to run a command in aws cli to stop the instance at a set time, but wondered if there were more interesting ways.

Here are the results of the epoch with processing time - looks like it took almost 30 minutes longer than the original estimate of 7000 seconds.

Found 21174 images belonging to 10 classes.
Found 250 images belonging to 10 classes.
Epoch 1/1
21174/21174 [==============================] - 8699s - loss: 1.8119 - acc: 0.5235 - val_loss: 0.6046 - val_acc: 0.7760

simoneva · April 27, 2017, 3:04pm

You may be interested in my solution which uses docker. This means you can store both data and program settings in a cheap snapshot. The AMI is minimal and created only for when you need it. It can also be a spot instance so you can create a GPU for around 20c per hour.

https://github.com/simonm3/xdrive

zach · April 27, 2017, 3:09pm

Thanks for the link, I will have to read through this. I have extremely limited experience w/ Docker but wondered containerizing stuff was an approach worth looking into. I don’t know enough about AWS yet to really understand the spot instance stuff either, so I have some reading to do.

As I understand it, I can say 'hey I want to run this thing on this type of instance, and I’m willing to pay up to $0.20/hr. And as long as the spot price is lower, my thing runs. If its higher, my thing doesn’t run. Is that about it?

simoneva · April 27, 2017, 3:30pm

I knew nothing about AWS or docker when I started this course.

If you want to do deep learning then the best solution is build your own GPU machine which is the most popular thread on this forum. This is likely twice as fast as the AWS ones. For now I am using the AWS spot instances. If you bid too low you just don’t get an instance. I set xdrive to bid 25c. So far I am paying average 20c and have never had an instance terminated. I have heard that sometimes the price can spike at a few $ so really the bid is just to make sure you are not caught out by that.

Docker is not essential for this course but worth learning. It allows the xdrive container to run on any operating system in exactly the same way with no dependencies apart from docker itself. It also makes it easy to create a clean linux machine from within windows and run some software on it.

lukeshepard · September 7, 2017, 9:56pm

I set up Cloudwatch alarms that shut down my instance when not in use. It takes just a few minutes on the web console. I set mine to shut down when network traffic is very low for > 2 hours – when I SSH into the instance, it uses network activity and thus is a proxy for usage. You can disable or extend this if you have a long training set.

https://aws.amazon.com/about-aws/whats-new/2013/01/08/use-amazon-cloudwatch-to-detect-and-shut-down-unused-amazon-ec2-instances/