Platform: AWS EC2 (DLAMI) ✅

The kernel dying problem can be resolved by moving to pytorch 1.0.0. But after that I was running into separate issues. As suggested by some, will try Sagemaker

@jm0077 @andrew771 @sujithvn @jonrosspresta @dhananjay014 I had to do a setup of fastai on an ec2 last week, and due to the reported issues on the DL AMI I did an install on top of the AWS Ubuntu 18 AMI. I published the result as the “fastai.v1” AMI, which you can search for when launching your instance. Should work out of the box with conda activate fastai.

5 Likes

I’ve had the same issues with the kernel dying on create_cnn.

After digging into the fastai source code a bit, it seems to occur when executing this line of code in the create_cnn function:

body = create_body(arch(pretrained), ifnone(cut,meta[‘cut’]))

I tried precalculating the two arguments, and it appears to fail on:

arch(pretrained)

The strange thing about it, is that if I replicate these exact steps manually in the notebook, it works fine.

@wdhorton HUGE thanks for putting this together–first time I’ve been up and running on EC2 in over a week.

thanks for the AMI, but could not find it in us-west-2, what region is it in?

Thanks for the AMI, but I could not find it in the Oregon region.

Also just wanted to know if this AMI is built out of the base image (without any frameworks installed) or is it on the image which has all the major frameworks pre-installed? Checking this because, the latter one takes up a huge chunk of your provisioned EBS and could be a waste if we are not using them.

Cheers !!

1 Like

@sujithvn @aymenim I just created a copy of the image in us-west-2 (Oregon) so you should be able to find it now.

The image is built on plain Ubuntu 18 with just conda, pytorch, fastai, and nvidia installed. So it should be pretty slim.

3 Likes

@wdhorton Thanks again… will try it today.

Thanks William, found your image in US-West-2 (Oregon) and back up and running!

I have noticed that it seems slower than the DLAMI. For reference, training the resnet34 model in Lesson 1, with all defaults took 4min48s, fine-tuning took 1min10s.

I just installed fastai using the following on a fresh AWS Ubuntu 16.04 p2.xlarge instance:
conda create -y python=3.6 --name fastai-py3.6
conda activate fastai-py3.6
conda install -y conda
conda install -y pip setuptools
conda install -y -c pytorch pytorch-nightly cuda92
conda install -y -c fastai torchvision-nightly
conda install -y -c fastai fastai
conda uninstall -y fastai
pip install -e .[dev]

I can see pytorch can access the GPU but as reported above jupyter kernel dies on create_cnn (eg on lesson1 pets), and I get a Illegal instruction (core dumped) also at create_cnn when running the notebook as a script.

Exact crash occurs in learner.py when call m.eval on the returned dummy_batch tensor which for cats and dogs is of shape torch.Size([1, 3, 64, 64]) as per below:

def dummy_eval(m:nn.Module, size:tuple=(64,64)):
“Pass a dummy_batch in evaluation mode in m with size.”
return m.eval()(dummy_batch(m, size))

Not possible for me to use a shared AMI, to get around this ill try pytorch1.0.0… Anyone have tips when working with pytorch 1.0.0 and fastai v1?

Adding myself up to the list of people suffering from this problem… Aren’t there people from AWS that can help us out here?

1 Like

Using Ubuntu 16 worked!. (One extra step install the nvidia drivers)
I can update the official documentation with the latest instructions. @jeremy should i create a pull request to do this?.

That’s great! In the meantime, could you post your install commands here?

# Update
sudo -i
apt-get update && apt-get --assume-yes upgrade

# Install Lib
sudo apt-get --assume-yes install build-essential gcc g++ make
binutils htop screen
software-properties-common unzip tree awscli cmake

# CUDA 10
wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64
sudo dpkg -i cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64
sudo apt-key add /var/cuda-repo-10-0-local-10.0.130-410.48/7fa2af80.pub
sudo apt-get update
sudo apt-get --assume-yes install cuda

# CUDA Check
nvidia-smi

# Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-5.3.0-Linux-x86_64.sh
bash Anaconda3-5.3.0-Linux-x86_64.sh -b

# Environment Variables
export CONDA_HOME=/home/ubuntu/anaconda3
export PATH=$CONDA_HOME/bin:$PATH

conda update conda
conda upgrade --all --yes
conda install -c pytorch -c fastai fastai

4 Likes

Note: Use good old ubuntu 16 AMI

1 Like

Sure! Thanks for the offer :slight_smile: I’m not sure what happened with the AWS DLAMI - we had this same problem. Our solution was to use the plain Ubuntu 18 AMI. I’d suggest that over Ubuntu 16.

1 Like

Do we need to use the extra commands provided to install nvidia drivers if we use Ubuntu 18?

Yes I think all the steps are exactly the same.

@jeremy @astronomy88
Updated the documentation for Ubuntu 18 ami

Cheers

1 Like

Thanks for this… I had spent hours trying to work out what I did wrong until I found this thread.

After following your instructions I am up and running again.!

Thanks

Tony