AWS GPU install script and public AMI

Sorry I wasn’t clear! If you encounter anything else that seems off or come across any other issues, please give me a shout and we will try to figure it out :slight_smile:

kudos to @radek !
I am testing your script on p3.2xlarge and it looks like torch uses only CPUs.

Did it use GPU on p2.xlarge?

1 Like

Yes, it did, on p2.xlarge. I am not sure what is the issue - keras seems to work okay with the tensorflow backend on p3.2xlarge.

I wonder what could be happening - I know very little about torch so won’t be able to troubleshoot this most likely. I tried running torch.cuda.is_available() on p3.2xlarge and it returned True.

I guess I was hoping for a discussion like we are having right now as I am very new to all this as well :slight_smile: Thx for reporting the problem!

Maybe this can be traced to using CUDA 9 instead of CUDA 8. I will replace CUDA 9 with CUDA 8 at the latest tomorrow and will see if it changes anything for the better.

I believe CUDA 9 is required for V100/p3

@radek sure, I am curious to make it work on p3. I don’t think CUDA8 will work V100 (for me even nvidia-smi could not recognize GPU on CUDA8 on p3)

Will keep troubleshooting…

1 Like

Doing a bit more reading and I think you are right. Tesla K80 is kepler architecture (not even pascal), whereas V100 is Volta which IIUC requires CUDA 9.

I think the issue is the way I install pytorch - I am guessing that we need to build it from source and not install via conda but not sure, reading on this further.

According to the discussion forums for Pytorch you can’t even install from source for CUDA9 yet. So you may have to wait a bit here…

2 Likes

I fixed the script and it now works both with p2 and p3 instances :slight_smile: Had to compile pytorch and torchvision.

Ran the entire lesson1 notebook with no problems on p3.2xlarge :slight_smile:

Changes to the install script pushed. Will create the AMI and update the original post as soon as it finishes building.

2 Likes

thanks @radek!

AMI created - the new AMI is ami-acac0ad5.

had this issue:

A client error (InvalidParameterValue) occurred when calling the CreateSubnet operation: Value (eu-west-1a) for parameter availabilityZone is invalid. Subnets can currently only be created in the following availability zones: us-west-2a, us-west-2b, us-west-2c.
usage: aws [options] [parameters]
aws: error: argument --subnet-id: expected one argument

Also afterwards I got this issue:

A client error (InternetGatewayLimitExceeded) occurred when calling the CreateInternetGateway operation: The maximum number of internet gateways has been reached.

Oh dear :smiley: Having compiled pytorch on p3.2xlarge this now fails on a p2.xlarge :slight_smile:

Todo for tomorrow for myself:

  • see if the install script works on p2.xlarge
  • if yes, create a new AMI for use with p2.xlarge instances
2 Likes

At what point did you get these errors? Are you using my scripts or the scripts from part1 v1 of the course?

@radek

Thanks for posting these instructions. I’m looking forward to getting this set up.

In the first step to spin up the p2.xlarge instance, should I be using “setup_p2.sh” in courses/setup? I was getting an error like this:

(aws) kmatsuda12ctower:setup kmatsuda$ ssh -i /Users/kmatsuda/.ssh/aws-key-fast-ai.pem ubuntu@ec2-52-32-247-202.us-west-2.compute.amazonaws.com
Enter passphrase for key ‘/Users/kmatsuda/.ssh/aws-key-fast-ai.pem’:

Looking in the forums, I found this thread in which somebody ran into the same issue. Jeremy’s response in that thread was:
“Somehow you’ve ended up with a password protected key. Might be easiest to start over.”

I ran “fast-ai-remove.sh” and started over, but am still getting the same error. Should I still be following the instructions in the AWS setup video to get the p2 instance going?

Is anyone else running into this? I can wait for the new AMI, but I would like to understand what the problem is and how to resolve it (if anyone knows).

Thanks

1 Like

I do not use those scripts so it is hard for me to comment. I use a slightly modified version that you can find here.

I have not visited this repo of mine in quite a while and forgot what was in the readme but this information seems like it might be useful :slight_smile:

In the request-spot-instance.sh on line #11 replace the ami with an ami of your choice:

export ami="ami-785db401" <- the default one is Ubuntu Server 16.04 LTS I believe.

The instructions from the original post in this thread assuming you will have some solution for doing the above (the step 1 and step 2 - spinning up an instance and SSHing into it) but now that I think of it this might be useful for other people as well who - let me update the original post.

BTW this assumes you have the AWS cli configured.

1 Like

Thanks @radek. I think I may have a combination of problems, but re-reading your post it sounds like I’m not in the correct region. When I run these scripts I get stuck with needing that AMI you reference which is in ‘eu’ and I’m in ‘us’. I may still run into the issue I had mentioned before, but I think I’ll wait for Jeremy’s new AMI and re-ask the question if I run into it then. Sorry for the noise.

1 Like

No worries at all! I think you are spot on regarding the ami availability. I think I could copy it to a different region but not sure if there is a region that everyone in US uses and also not sure if there would really be any interest in the ami. Besides, the one from @jeremy is definitely the way to go.

Well, quite unsurprisingly I guess pytorch built on p3.2xlarge will only work on p3.2xlarge and same goes for p2.xlarge.

I built it again on p2.xlarge and created a public AMI: ami-004aec79. Again, available only in eu-west-1.

@radek
I couldn’t run the Jupyter notebook on a reserved instance.
The install script ran without any errors.
However the ./start-jupyter-notebook doesn’t do anything.
Nor was I required to do the steps 3,4,5.

hey @init_27 - did ./start-jupyter-notebook give you an error? also, after downloading the installation script did you run bash install-gpu-part1-v2.sh?

Could you please run history 20 and copy the output?

1 Like