Using AWS P3 instances

Hi @fizx,

thanks for the tip. I just created an AWS SPOT instance asking for access to p3.2xlarge, p3.8xlarge and p3.16xlarge in the US West (Oregon) region using the Fastai AMI.

Let’s wait and see.

More information about AWS SPOT instance : https://aws.amazon.com/ec2/spot/?nc1=h_ls

[ EDIT - 1st of June, 2018]

@jeremy : I forgot to update this post saying that AWS finally accepted my request for p3 on April 24th. Thanks :slight_smile:


Hi @jeremy,

thanks for your help as I do not understand what AWS has in mind.
My first request about using an AWS p3 instance was on 04/15/2018. Let’s see how long it will take to get a final YES or NO.

Another point : in my last request to AWS, I sent the AWS region list of the Fastai AMI from @reshama.
This list is uptodate ? What about putting a Fastai AMI in all AWS regions ?

Hello @Moody,

I guess since November 2017 (today is the 31st of May 2018), the https://pytorch.org/ site updated its package and then, we do not need anymore to install pytorch from source but we can use conda.

In order to use the fastai part1v2 AMI instance on AWS p3 instance, just run the following code in your ubuntu ssh connection to AWS :

conda remove pytorch
conda install pytorch torchvision cuda90 -c pytorch

I just tried it on Amazon Web Services (AWS) p3.2xlarge and it worked :slight_smile:
And I tested it on lesson1.ipynb : p3.2xlarge works 400% faster than p2.xlarge :slight_smile:

Note : I did try as well to install cuda91 (conda install pytorch torchvision cuda91 -c pytorch), but it did not work (the lesson1 notebook did not detect cuda : torch.cuda.is_available() = False).

1 Like

Thanks for the update. Per GitHub log, Jeremy updated the environment.yml on 21 Dec 2017 to cuda90. The fastai AMI should be ready to go since then. I don’t think you need to run extra conda commands.

Well on my side, I had to. Therefore, I understand what you say @Moody but I do not understand why I had to follow another path.

This is the path I followed :

  1. After I launched my AWS p3.2xlarge instance, I did conda update --all, cd fastai, git pull, conda env update.
  2. Then, I ran the lesson1 notebook, and got the following message after 8 minutes (note : my lesson1 notebook did run but slowly as it was Cuda 8 that was compiled as showed in the following message) :
/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/cuda/__init__.py:95: UserWarning: 
    Found GPU0 Tesla V100-SXM2-16GB which requires CUDA_VERSION >= 9000 for
     optimal performance and fast startup time, but your PyTorch was compiled
     with CUDA_VERSION 8000. Please install the correct PyTorch binary
     using instructions from http://pytorch.org
    
  warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

Finally, I ran in my ssh Terminal to AWS the following 2 lines and then, my lesson1 did work correctly with Cuda 9 :

Note : I did try as well to install cuda91 (conda install pytorch torchvision cuda91 -c pytorch), but it did not work (the lesson1 notebook did not detect cuda : torch.cuda.is_available() = False).

Yeah I think somehow they get tied to the old channel.

1 Like

Hi guys, I’ve noticed that P3 instances aren’t being used at full capacity when training on imagenet using pytorch main imagenet examples and also some of my own code. Although I do notice the 4x speedup over P2 instances as some others have noted, I do believe there is about an extra 2X left (for a total of 8X) speedup as running constant tensors through the GPU’s goes about twice as fast. I’ve spent a long time trying to profile whats going on, and I believe its an I/O issue on Amazon’s EBS volumes/instances, but I can’t seem to figure out what exactly is bottlenecking. I think the problem isn’t that big a deal on P2’s since the GPU is bottlenecking, but it becomes apparent for faster GPUs. Does anyone have ideas?

Sorry for the repeat post guys, but after looking into it a bit more - it seems to be half related to reading constraints, and half limited by the CPU power of the instance. If we crop to 1 pixel instead of 224, we can dramatically speed up (50+%) the training process. I think potentially the P3 instance might not be proportionally equipped CPU-wise to handle higher resolution image on-the-fly pre-processing ?

Hi ,
I was using p3 instances and looking at the amount of ram available I can fit the whole data into RAM. Will loading the whole dataset into ram be faster than loading the data in batches from ssd saving IO operations ?