Setup problems: AWS

I’ve also somehow ended up with a g2 instance. It may be because I didn’t enter the code “fast.ai MOOC” in my initial request.

For those who run into the same issue: for whatever reason the default region in the credentials file didn’t match the default region in the config file. Removing the region reference in the former fixed it.

NVIDIA GPU Box:

After a $150 bill with AWS (I forgot to aws-stop), I decided it was time to build my own GPU/CUDA box. Couple of questions:

  1. By mistake, I purchased three NVIDA Titan XP cards instead of the one I had intended. I have only installed one on my GPU Box. Should I return two of these, or can I use these on the same box seamlessly? I’ve tried to research this question online and I am now as “clear as mud”. My only application for CUDA is deep learning. If I can use an extra card without changing my code, I may keep one. If I have to code to the device, I will return these cards. (Does the answer change if I were to switch from Theano to Tensor Flow?)

  2. On my GPU Box, I used Ubuntu 17.04. I’m having some issues. Is it worth tying to make things work or just retreat to 16.04? (I haven’t spent enough time on trying to make it work, so I will report back as to my progress.)

I have been able to activate an AWS p2 gnu instance and work through the first homework. Very cool! I have three teenage kids and they are literally blown away that a computer can predict dogs vs. cats with such high accuracy!

Thanks,

Bryan

Tensorflow supports using multiple GPUs. If you dont have budget issues and you can fix them on your current hardware. You should keep it. Good problem to have :slight_smile:
https://www.tensorflow.org/tutorials/using_gpu#using_multiple_gpus

Local GPU Box:

After 4-5 days of experimentation, I was finally able to install a GPU Box locally.
IMPORTANT: Use Ubuntu 16.04. DON’T try any other version. (I tried with 17.04 and therein was madness. . . )

I am using a single nvidia Titan XP Card. (I will try with a second card and report back.)

In terms of benchmarking, I found the Titan XP card to be twice as fast as the AWS p2 instance when training the “dogs-vs-cats-redux-kernels-edition”

For others wanting to try, here is my script to set up a local Ubuntu 16.04 box. This is basically the ‘install-gpu.sh’ script with some personal customizations:

git clone https://github.com/prairie-guy/gpu_setup.git

1 Like

Hi @prairieguy I am just curious how much wall time it takes for 1 epoch training for dogs and cats problem lesson 2 using this GPU card. I am having 780M which has compute capacity of 3 , it takes 30 mins for each epoch. I am just trying to find how can I make it faster.

For those having problems with broken pipe or connection timeout on OSX I had to disable the QOS option in SSH via -o IPQoS=0 or in the ssh config file.

I identified this when I used -vvv with SSH to and then noticed the SSH session was getting stuck at “send packet: type 1” in the console window.

This is documented here -

1 Like

jagatsingh - For the Dogs vs. Cats Redux problem (fine tuning of Vgg16.py), my “Wall time” with a Titan XP card was 4 minutes 15 seconds.

prairieguy

wget works for me if you use the url for raw file, don’t use other urls to setup_p2.sh, wget will download html tags along with the file contents. That’s why you got ‘newline’ issue there.

Hey! could you explain exactly how you did it? do I need to terminate the instance and then try again after the error?
Thanks

aws-start shortcut problem

After setting up everything correctly, I wanted to find the the url for my instance, so I (perhaps stupidly in hindsight) typed ‘bash setup_p2.ssh’ again to look at the output. It created a new p2 instance and now I can’t use the ‘aws-start’ shortcut any more. Instead, I get the following:

``` An error occurred (InvalidParameterCombination) when calling the StartInstances operation: No instances specified
###How can I fix this?
###Help much appreciated :smile:

I’ve been stuck for two weeks now on running the script to setup the p2 instance. I followed the instructions to restart with Amazon (all good), I uninstalled python and reinstalled ensured I was using 2.7, I reran the aws configure stuff, made sure that my user had correct permissions in amazon, but I’m still getting the same error messages as before.

It seems the setup script is doing something (It seems to have made a new security group in AWS and a new elastic IP as well, I think, but I really don’t know what might be causing these “Malformed” errors. Anyone clues out there?

Hi @andrewtuplin,

Which operating system and shell are you using, and how did you get the setup files?

Windows 10. Cygwin shell. I used wget to download the setup file from http://files.fast.ai/files/setup_p2.sh

Can you try getting the files from the GitHub repository?

wget https://raw.githubusercontent.com/fastai/courses/master/setup/setup_p2.sh
wget https://raw.githubusercontent.com/fastai/courses/master/setup/setup_instance.sh

Did you solve the problem? I met the same problem and even after I set region to us-west-2, it still did not work.

I choose default region is Oregon on aws console, and I also use “aws configure --profile” to set region in aws-cli to us-west-2. Every time I ranbash setup_t2.sh, I met the error as following:
An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id ‘[ami-f8fd5998]’ does not exist
Then I went to check AWS and found a new VPS had been created in Virginia region not Oregon.
Can anybody help me to figure it out? Thanks a lot!

EDIT: fixed by changing “Values” to t2.xlarge from t2.large in the aws-get-p2 command in aws-alias.sh

Hey all,

I ran the t2 set up and the the p2 set up script – hoping to have both to switch back and forth on for development.

However, now when I run “aws-get-t2” it returns “None”. I can get p2 just fine. I assume this is due to the order I ran the script in – and I can ssh into the t2 instance just fine.

Elizabeths-MacBook-Pro-2:~ elizabeth$ aws-get-t2
None

Any tips on why it might not be returning the t2 instance id?

Thanks!

When you use the aws-get-p2, is it still returning an instance id?

Hi Vijay,

I am also stuck at your step. unable to start awscli and it wont uninstall too, it gets stuck. Can you please share update on how did you resolve your case? Would be much helpful.

Thanks,
Vivek.S