AWS GPU install script and public AMI

stevelizcnao · November 4, 2017, 6:43pm

I checked on the PyTorch slack and I got feedback that everything should be working fine for CUDA9 and cudnn and v100. What problems are you seeing?

radek · November 4, 2017, 6:57pm

Everything compiles and works, but there are a lot of deprecation warnings breaking the UI of fastai (not a big deal at this point, they can be disabled or we could fix them). The real problem is that for reasons unknown to me, training of the initial model in lesson1 takes around twice as long as when using conda cuda8 along with the accompanying binaries.

Maybe it is not CUDA9 but the cudnn 7.0.3 I was using.

Robi · November 5, 2017, 10:46am

On the PyTorch forum there is some info about the PyTorch on v100 with CUDA 9 work-in-progress, although there is definitely no need to use that version for the course.

radek · November 5, 2017, 10:56am

I am thinking that maybe the issue is that we are not using cudnn (didn’t check this), since the slowness I think might be with the convolutional part.

Anyhow, the point you make is exactly what I was thinking Can continue to sink time into getting this to work, but better spend it on doing the actual course work So neat that pytorch devs provide us with all the necessary binaries for running this via conda with CUDA8

tensoralex · November 6, 2017, 9:59pm

@radek Actually I was running lessons in CUDA 9/cudnn7 and p3.2xlarge. Except deprecation warnings I did not notice particular slowness. But I haven’t chance to compare it with CUDA8 yet. Where (which code) have you noticed biggest slowness? I am curious to try same and compare timings.

radek · November 6, 2017, 10:25pm

I noticed a difference when someone posted a screenshot to the forums of the timings of precomputing the activations for the first model. I think the difference was that on CUDA9/cudnn7 the training took twice as long. I then went ahead and switched to the CUDA8/pytroch binaries combo and got same results as the person in the screenshot.

I think it was this screenshot that was helpful

tensoralex · November 7, 2017, 2:38am

Yeah, I ve got V100 slightly better on a longer run, but still wont worth the money.

this is p2xlarge:

this is p3.2xlarge:

iskode · November 8, 2017, 8:48pm

Hi, Nvidia has built specialized software for leveraging AWS EC2 P3 instance power computing.
This is from the message they sent me as I have an account on Nvidia Developer:

You can now access performance-engineered deep learning frameworks with NVIDIA GPU Cloud (NGC) and the newly announced Amazon EC2 P3 instances with NVIDIA Tesla V100 GPUs. With Tesla V100s, you can train over 10X faster than Kepler GPUs.

For more information follow the link below and create an account for free to use the tools :

iskode · November 8, 2017, 8:57pm

Hi @radek,
please may you possibly write a post about how to use your AMI based on spot instances as a step by step process like Jeremy did on the wiki lesson 2 (not fully detailed, but at least the steps): from choosing the AMI, setting configuration if applicable to launch and final setup for jupyter and so on…

Thank you so much.

radek · November 8, 2017, 9:22pm

Hi @iskode - please find setup instructions in the original post of this thread under the last section (Install script with AWS Spot instances with persistent storage).

I just updated the howto and this is what the end result looks like now:

I think initially it would be a good idea to follow the howto step for step, but once you start getting a hang of how AWS works it would be very straightforward to replace the AMI that you build in the howto with the official one from @jeremy. If you would need help on that please let me know. You can find the list of the official AMIs broken down by region in the Lesson 2 wiki.

mmr · November 17, 2017, 7:45am

Is there a way to know how much of the free credit I have used in amazon ec2.

radek · November 17, 2017, 7:49am

You can check it under your account > credits. I am not sure how reliable it is though.

iskode · December 27, 2017, 1:50pm

Hi @radek,
I’m on the point of switching from Reserved Instances to Spot Instances (Yeah a bit late!).
Please, could you possibly to show me in a screenshot, your bill during November to get
a clear idea of the cost and compare it to mine (below).

Thank you so much.

radek · December 27, 2017, 2:10pm

hey @iskode!

Please take a look below (this is for November)

iskode · December 27, 2017, 2:41pm

Thank you so much for your prompt reply.
Wow it’s really economical to run spot instances, 3 times lower with more computations time, 76 hrs Vs 59hrs for mine. And even the storage is nearly the same as Reserved Instances.

radek · December 27, 2017, 4:38pm

For the longest time I only kept a 20 GB SSD drive, essentially a workspace. At 10 cents / GB per month that ends up being just 2$

Another nice benefit here is that once you set everything up, it becomes very quick and easy to switch between instance types. For example, I am now considering learning embeddings on my local machine and uploading them to AWS to train random forests / xgboost classifiers and to take advantage of those crazily beefed up CPU instances.

Not sure if / when I will get a chance to get around to this cause of time constraints, but the idea has quite a bit of appeal to me

So many awesome things from p1 v2 / ML course I still haven’t had a chance to play around with!!!

Rishit · December 31, 2017, 11:46am

Please help me with this…
I’m just struct at this point.

$ bash setup_p2.sh
rtbassoc-87ceaefc

An error occurred (InvalidID) when calling the CreateRoute operation: The ID 'rt ’ is not valid

An error occurred (InvalidGroupId.Malformed) when calling the AuthorizeSecurityG "oupIngress operation: Invalid id: "sg-ad7e33d1

An error occurred (InvalidGroupId.Malformed) when calling the AuthorizeSecurityG "oupIngress operation: Invalid id: "sg-ad7e33d1
setup_p2.sh: line 13: /home/welcome/.ssh/aws-key.pem: No such file or directory
chmod: cannot access ‘/home/welcome/.ssh/aws-key.pem’: No such file or directory

An error occurred (InvalidKeyPair.NotFound) when calling the RunInstances operat ion: The key pair ‘aws-key’ does not exist
Waiting for instance start…
Waiter InstanceRunning failed: Max attempts exceeded
usage: aws [options] [ …] [parameters]
To see help text, you can run:

aws help
aws help
aws help
aws.exe: error: argument --instance-id: expected one argument
securityGroupId=sg-ad7e33d1
subnetId=subnet-6ad8f822
instanceId=
instanceUrl=None
Connect: ssh -i /home/welcome/.ssh/aws-key.pem ubuntu@None

jeremy · December 31, 2017, 5:06pm

@Rishit you should simply use the fast.ai AMI, rather than trying to setup from scratch.

iskode · December 31, 2017, 9:31pm

Hi Radek, I’ve been working my way through instructions using the fastai ami instead of installing everything from scratch. But now I’m worried about the persistence in itself as it’s done in the workspace directory. So what if I update a package or conda, will this be saved? if not how to persist the system state because this will be a big problem as at each spot request the system is restored as it was in the AMI image. So how you deal with such situation?
thank you so much.

ecdrid · January 1, 2018, 3:03am

If a package is updated, the corresponding changes will be made by someone to the NBS and moreover we ourself can track them as the NBS serves as a reference (how to do play with datasets)
Also upgrading packages always keep backward compatibility…