Platform: AWS EC2 (DLAMI) ✅

Seems to be problem that’s not getting fixed with the nightly builds. My workaround was to switch to SageMaker.

1 Like

Issue still persists. I tried on my local computer with CPU version. Its running fine as of now (but as expected very very slow)… did not try updating pytorch in case it breaks things here too :stuck_out_tongue_winking_eye:

awaiting fix on AWS :thinking:

Hi @jeremy, the EC2 platform seems to be broken for FastAI, since about Nov. 11. I tried it with DLAMI versions 16 and 17, with nightly builds of all the packages. Any time create_cnn is invoked, the kernel immediately dies. It looks like other folks have been having the same issue. Any thoughts on how we can get help to fix this?

2 Likes

Just wanted to chime in and say that I’m also having this issue. I followed the AWS setup instructions to a tee (only difference being that I used a previous .pem file but I can’t imagine that matters) on two separate boxes and I’m still having the issue. I have the suggested p2.xlarge box, and I run into the error when calling create_cnn, both on my own notebook and the lesson1-pets notebook. Because the kernel kept restarting I decided to put all the code into a .py file and run it that way. I was still met with an error, this time the error was Illegal instruction (core dumped). I found online that this error happened to some people when they import the Tensorflow package. I know that we are not doing that, but I wonder if both PyTorch and Tensorflow have some dependency that we are not satisfying.

1 Like

The kernel dying problem can be resolved by moving to pytorch 1.0.0. But after that I was running into separate issues. As suggested by some, will try Sagemaker

@jm0077 @andrew771 @sujithvn @jonrosspresta @dhananjay014 I had to do a setup of fastai on an ec2 last week, and due to the reported issues on the DL AMI I did an install on top of the AWS Ubuntu 18 AMI. I published the result as the “fastai.v1” AMI, which you can search for when launching your instance. Should work out of the box with conda activate fastai.

5 Likes

I’ve had the same issues with the kernel dying on create_cnn.

After digging into the fastai source code a bit, it seems to occur when executing this line of code in the create_cnn function:

body = create_body(arch(pretrained), ifnone(cut,meta[‘cut’]))

I tried precalculating the two arguments, and it appears to fail on:

arch(pretrained)

The strange thing about it, is that if I replicate these exact steps manually in the notebook, it works fine.

@wdhorton HUGE thanks for putting this together–first time I’ve been up and running on EC2 in over a week.

thanks for the AMI, but could not find it in us-west-2, what region is it in?

Thanks for the AMI, but I could not find it in the Oregon region.

Also just wanted to know if this AMI is built out of the base image (without any frameworks installed) or is it on the image which has all the major frameworks pre-installed? Checking this because, the latter one takes up a huge chunk of your provisioned EBS and could be a waste if we are not using them.

Cheers !!

1 Like

@sujithvn @aymenim I just created a copy of the image in us-west-2 (Oregon) so you should be able to find it now.

The image is built on plain Ubuntu 18 with just conda, pytorch, fastai, and nvidia installed. So it should be pretty slim.

3 Likes

@wdhorton Thanks again… will try it today.

Thanks William, found your image in US-West-2 (Oregon) and back up and running!

I have noticed that it seems slower than the DLAMI. For reference, training the resnet34 model in Lesson 1, with all defaults took 4min48s, fine-tuning took 1min10s.

I just installed fastai using the following on a fresh AWS Ubuntu 16.04 p2.xlarge instance:
conda create -y python=3.6 --name fastai-py3.6
conda activate fastai-py3.6
conda install -y conda
conda install -y pip setuptools
conda install -y -c pytorch pytorch-nightly cuda92
conda install -y -c fastai torchvision-nightly
conda install -y -c fastai fastai
conda uninstall -y fastai
pip install -e .[dev]

I can see pytorch can access the GPU but as reported above jupyter kernel dies on create_cnn (eg on lesson1 pets), and I get a Illegal instruction (core dumped) also at create_cnn when running the notebook as a script.

Exact crash occurs in learner.py when call m.eval on the returned dummy_batch tensor which for cats and dogs is of shape torch.Size([1, 3, 64, 64]) as per below:

def dummy_eval(m:nn.Module, size:tuple=(64,64)):
“Pass a dummy_batch in evaluation mode in m with size.”
return m.eval()(dummy_batch(m, size))

Not possible for me to use a shared AMI, to get around this ill try pytorch1.0.0… Anyone have tips when working with pytorch 1.0.0 and fastai v1?

Adding myself up to the list of people suffering from this problem… Aren’t there people from AWS that can help us out here?

1 Like

Using Ubuntu 16 worked!. (One extra step install the nvidia drivers)
I can update the official documentation with the latest instructions. @jeremy should i create a pull request to do this?.

That’s great! In the meantime, could you post your install commands here?

# Update
sudo -i
apt-get update && apt-get --assume-yes upgrade

# Install Lib
sudo apt-get --assume-yes install build-essential gcc g++ make
binutils htop screen
software-properties-common unzip tree awscli cmake

# CUDA 10
wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64
sudo dpkg -i cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64
sudo apt-key add /var/cuda-repo-10-0-local-10.0.130-410.48/7fa2af80.pub
sudo apt-get update
sudo apt-get --assume-yes install cuda

# CUDA Check
nvidia-smi

# Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-5.3.0-Linux-x86_64.sh
bash Anaconda3-5.3.0-Linux-x86_64.sh -b

# Environment Variables
export CONDA_HOME=/home/ubuntu/anaconda3
export PATH=$CONDA_HOME/bin:$PATH

conda update conda
conda upgrade --all --yes
conda install -c pytorch -c fastai fastai

4 Likes

Note: Use good old ubuntu 16 AMI

1 Like

Sure! Thanks for the offer :slight_smile: I’m not sure what happened with the AWS DLAMI - we had this same problem. Our solution was to use the plain Ubuntu 18 AMI. I’d suggest that over Ubuntu 16.

1 Like