If you are looking for an AMI and not an installation script, you should be using the official one provided by @jeremy and not the one I share in this thread.
IMPORTANT: Turns out pytorch doesn’t work well with CUDA9. Maybe you need to compile it differently, maybe there is something else that I am missing. It works okay with CUDA8 - I updated the script to use CUDA8. This means we don’t have p3 support but for using pytorch this is the way to go.
In the new installation script I no longer download the dogsvscats dataset nor do I pull a version of fastai from my forked repo. You should be doing
git clone https://github.com/fastai/fastai.git instead.
EDIT: I share one way of spinning up an instance and sshing into it in this post. These instructions are equivalent to step #1 and step #2 mentioned below.
Install script with blank Ubuntu Server 16.04 LTS (ami-785db401) instance
- Spin up a p2.xlarge or p3.2xlarge
- SSH into your instance and run:
- Press enter / y when prompted.
- Enter a password for jupter notebook when prompted and hit enter.
- To run jupyter notebook, execute
- To connect to your jupyter notebook, either configure a public IP for your instance or follow the instructions provided here for tunneling [recommended].
Install script with AWS Spot instances with persistent storage
- this gives you p2 instances for ~0.25$ per h -
- Follow the instructions here up to step #6.
- In step #6, replace the command with the following:
- uncomment lines 69 - 71
- Perform the remaining steps.
AWS AMI with the above installed
AMI for p2.xlarge: ami-19a00360 (new AMI with CUDA8)
Accessible only from region: eu-west-1
(BTW, we’ll be providing an AMI to make this easier in the next few days, so don’t worry if this all looks a little over-whelming! And we won’t be using AWS in the course until lesson 2 at the earliest).
Hey @radek … what is your experience using the spot instances?
The price is unbeatable so I’m wondering what the cons are (and how seriously they are cons) when compared to the standard p2 instances.
Hardware-wise they are the same instance, but with spot instances if you stop them, they go poof. One way to work around this is attaching a volume when you spin up an instance and detaching it at tear down. I wrote some bash scripts that make this super easy for me and share them above in the “Install script with AWS Spot instances with persistent storage” section.
If you don’t care about persisting your data (and want to for example only work on the notebooks provided for the course) or are happy with uploading your work to github when you are done (only works easily for small files though, can’t save weights etc), than you can spin up a spot instance using the provided AMI.
In very brief summary:
- spot instances -> cheaper but extra work
- on demand instances -> more expensive but less work
Nice writeup ! I have a setup for bootstrapping+teardown of EC2 instances (with VPC and all) using Terraform for my personal usage. Inspired by this, after @jeremy provides an AMI, I’ll try and optimize a zero-click terraform setup if possible and share it.
Looking forward to using Crestle for the session later today.
Just adding this information here for the ones not already familiar with AWS billing gotchas. One thing to keep in mind about spot instances setup, is that persisting data is not for free. Once provisioned, users get billed for existing EBS volumes, regardless of the usage(attached/used or not), prices being proportional to the size of the volume. (https://aws.amazon.com/ebs/pricing/) Most importantly, one must remember to delete/release the EBS volume later when not needed (after the course etc.) cc @wgpubs
Good stuff @suvash and @radek … thanks for the info!
I am in the same AWS region so I was able to launch your AMI, ssh to it and do the installation following your instructions. I was not able to connect to the jupyter server though. I logged in again to the instance to check, and in the Jupyter server screen (still running) I saw a stack of error messages like this:
SSL Error on 10 ('127.0.0.1', 39422): [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:777)
I don’t have time for debugging now (pending a 2.30 am wake up call ) but I will give it another try tomorrow. FYI during the installation I also noticed some error messages due to already existing dirs (anaconda3, fastai and dogscats).
Great work from your part anyway. Thanks!
P.S. I see you install CUDA 9, is PyTorch working with that CUDA version? (on the PyTorch site it’s still CUDA 8)
Hi @radek -
Awesome writeup. PyTorch docs in pytorch.org has only binaries for CUDA 7.5 and 8. So it might be easier to install with them than compile from scratch using CUDA 9.
Also, your step of installing PyTorch might not be CUDA enabled -
conda install pytorch torchvision -c soumith is what’s in your
Docs say to use
conda install pytorch torchvision cuda80 -c soumith.
Just wanted to point out the differences. If you are already aware and it works anyway, then that’s great. Thanks for sharing these steps.
Using the AMI, you do not need to install anything - everything should already be preinstalled. Ideally you should just be able to boot up the instance using the AMI and be ready to go.
Good point! Frankly speaking, I do not know - it seems to work You are probably right though - should have gone with CUDA 8 as that is what’s in the docs.
The good news is that if we run into any problems with CUDA 9 it will should be easy to replace as I somewhat understand now what is going on with the install, or at least what are the components necessary to get this up and running and where to get them. Should be fairly easy to replace one of the pieces but cursory google search inspires me with optimism that this might work ok
I misinterpreted your “Install script with your AWS instance” instructions as something to be done after launching my instance with your AMI. If the AMI is ready to run as-is, much better!
Sorry I wasn’t clear! If you encounter anything else that seems off or come across any other issues, please give me a shout and we will try to figure it out
kudos to @radek !
I am testing your script on p3.2xlarge and it looks like torch uses only CPUs.
Did it use GPU on p2.xlarge?
Yes, it did, on p2.xlarge. I am not sure what is the issue - keras seems to work okay with the tensorflow backend on p3.2xlarge.
I wonder what could be happening - I know very little about torch so won’t be able to troubleshoot this most likely. I tried running
torch.cuda.is_available() on p3.2xlarge and it returned
I guess I was hoping for a discussion like we are having right now as I am very new to all this as well Thx for reporting the problem!
Maybe this can be traced to using CUDA 9 instead of CUDA 8. I will replace CUDA 9 with CUDA 8 at the latest tomorrow and will see if it changes anything for the better.
I believe CUDA 9 is required for V100/p3
@radek sure, I am curious to make it work on p3. I don’t think CUDA8 will work V100 (for me even nvidia-smi could not recognize GPU on CUDA8 on p3)
Will keep troubleshooting…
Doing a bit more reading and I think you are right. Tesla K80 is kepler architecture (not even pascal), whereas V100 is Volta which IIUC requires CUDA 9.
I think the issue is the way I install pytorch - I am guessing that we need to build it from source and not install via conda but not sure, reading on this further.
According to the discussion forums for Pytorch you can’t even install from source for CUDA9 yet. So you may have to wait a bit here…
I fixed the script and it now works both with p2 and p3 instances Had to compile pytorch and torchvision.
Ran the entire lesson1 notebook with no problems on p3.2xlarge
Changes to the install script pushed. Will create the AMI and update the original post as soon as it finishes building.