Platform: SageMaker ✅

tbass134 · November 8, 2019, 7:29pm

im having this issue as well

aubergine · November 11, 2019, 6:51am

I had this issue too and I think I resolved it by running these steps from the “Start notebook” lifecycle configuration script directly in the terminal available from the notebook instance (very bottom option in the New dropdown) - the version I have has the pytorch updates commented out which is likely the source of the error

source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name ‘fastai’ --display-name ‘Python 3’ --user
conda install -y pytorch torchvision -c pytorch
conda install -y fastai -c fastai
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user
pkill -f jupyter-notebook

tbass134 · November 11, 2019, 11:53am

oh my goodness thank you so much. This fixed the issue

tbass134 · November 20, 2019, 1:55pm

Hello, im trying to deploy a Fastai model on Sagemaker and seeing this error in cloudwatch
sagemaker_containers._errors.InstallModuleError: InstallModuleError

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py", line 56, in handle
    self.handle_request(listener_name, req, client, addr)
  File "/usr/local/lib/python3.6/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
    addr)
  File "/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py", line 107, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_pytorch_container/serving.py", line 103, in main
    user_module = modules.import_module(serving_env.module_dir, serving_env.module_name)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_modules.py", line 244, in import_module
    install(_env.code_dir)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_modules.py", line 110, in install
    _process.check_error(shlex.split(cmd), _errors.InstallModuleError, cwd=path, capture_error=capture_error)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_process.py", line 48, in check_error
    raise error_class(return_code=return_code, cmd=' '.join(cmd), output=stderr)

I have a requirements.txt that install fastai and librosa

fastai
librosa
SoundFile

Alden · November 20, 2019, 3:04pm

Hi - Can I ask how you folks are running training overnight in sagemaker notebooks? I automatically get logged out and that suspends the training. according to this comment from amazon that’s the expected behavior, but I feel like I’ve seen Jeremy and others show the results of overnight trainings.

I could follow the instructions in that comment and execute a python script with nohup (I’m already using SaveModelCallback), but I’d prefer to see the output of each epoch like you can in a notebook.

Thanks

Alden · November 21, 2019, 5:48pm

For future folks looking into this - I ended up largely solving this issue following the advice in the above link. I wrote a python script in the notebook instance that trained for 4 epochs saving the model each time I got a best-ever validation loss (using SaveModelCallback).

I activated the conda environment I wanted, (source activate conda_pytorch_36), then I executed that script with nohup, which ignores the hangup signal when you get disconnected from the notebook instance:
nohup python my_training_script.py

nohup will automatically save the stdout, including the training performance metrics, to a text file named nohup.out.

Albertotono · March 12, 2020, 11:39pm

Dear All,

I followed this great tutorial

github.com

fastai/course-v3/blob/master/docs/start_sagemaker.md

---
title: SageMaker
keywords: 
sidebar: home_sidebar
---

This is a quick guide to starting v3 of the fast.ai course Practical Deep Learning for Coders using Amazon SageMaker. 

If you are returning to work and have previously completed the steps below, please go to the [returning to work](https://course.fast.ai/update_sagemaker.html) section.

We will use [AWS CloudFormation](https://aws.amazon.com/cloudformation/) to provision all of the SageMaker resources including the Notebook instance, Notebook Lifecyle configuration and IAM role. By default it will provision a SageMaker notebook instance of type *ml.p2.xlarge* which has the Nvidia K80 GPU and 50 GB of EBS disk space.

## Pricing

The instance we suggest, ml.p2.xlarge, is $1.26 an hour. The hourly rate is dependent on the instance type selected, see all available types [here](https://aws.amazon.com/sagemaker/pricing/).  You will need to explicitely request a limit request to use this instance or the ml.p3.2xlarge instance, [here](https://course.fast.ai/start_aws.html#step-2-request-service-limit ) Instances must be stopped to end billing.

## Getting Set Up

### Creating the SageMaker Notebook Instance

This file has been truncated. show original

and in Ohio I had to increase the limit to 1
“The account-level service limit ‘ml.p2.xlarge for notebook instance usage’ is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit. (Service: AmazonSageMaker; Status Code: 400; Error Code: ResourceLimitExceeded; Request ID: b26d76fc-d6ac-4b8c-b104-ba31b226cf7e)”

waiting for a reply at the moment, anyone that experience the same issue?

mariokostelac · March 15, 2020, 1:38pm

I’ve forgot to turn off the machine several times and decided to automate shutting down when inactive. Turns out Jupyter exposes nice API so you can know the last time it ran something.

To get the same behaviour, just add this script to your lifecycle configuration.

If you don’t know how to do that, I’ve written a guide to do that.

Btw if you don’t want to turn off your machine after 1 hour of inactivity, increase the IDLE_TIME variable (it’s in seconds!).

mahesh · April 1, 2020, 5:46pm

Having trouble downloading files in the sagemaker notebooks. Did anyone else face this issue?

Posted this on stackoverflow as well.

matt.mcclean · April 1, 2020, 8:42pm

Yes it is a known issue and the AWS team is looking into it

zack404 · April 22, 2020, 11:15am

@matt.mcclean Is there any update about Elastic Inference and fastai ?

matt.mcclean · April 22, 2020, 11:23am

I haven’t had a chance to try it out. In theory it should work as you only need to have a PyTorch model that compiles to TorchScript. See the guide here: https://docs.aws.amazon.com/elastic-inference/latest/developerguide/ei-pytorch.html

zack404 · April 22, 2020, 6:39pm

@matt.mcclean But I think compiling fastai model to TorchScript is not possible yet ?

maivel · April 25, 2020, 8:26am

I have a SageMaker work account that has restricted permissions on dealing with Stacks. Are the on create and on start lifecycle scripts available anywhere? I could create the notebook instance with those.

ehrene · May 28, 2020, 12:22pm

@maivel i’m in a similar situation where i have most permissions enabled but am working out of a shared account. I’m trying to get the lifecycle configuration set up to run and am having errors. I was able to get the script content out of the cloudformation scripts referenced in the latest guide (https://course.fast.ai/start_sagemaker.html). Now i’m encountering an error on startup where things are failing to start my instance.

@matt.mcclean do you have any insight into what might be causing the errors? I’m fairly sure i copied the scripts correctly from cloud formation:

On notebook start:

#!/bin/bash

sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

echo "Install a new kernel for fastai with name 'Python 3'"
source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch

echo "Update fastai library"
conda install -y fastai -c fastai

echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user

echo "Restarting jupyter notebook server"
pkill -f jupyter-notebook

echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v3
git pull

echo "Finished running onStart script"
EOF

On create:

#!/bin/bash

sudo -H -i -u ec2-user bash << EOF

# create symlinks to EBS volume
echo "Creating symlinks"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3

echo "Finished running onCreate script"
EOF

i’ve read thru the script content and understand pretty much all of it (i think) and am not sure what could be causing the error. I don’t have sufficient access currently to see cloudwatch data for the account. I was able to run the outdated fast.ai lifecycle configs in my account tho.

i’m itching to get rolling in sagemaker and am hoping i can get this running soon!

adaley222 · May 29, 2020, 9:54am

I’m having the same issue as @ehrene. My instance was working fine last week, and now I’m getting the same error regarding the Lifecyle configuration.

It seems to get stuck installing the most recent fast.ai library. The CloudWatch logs for the notebook instance get stuck in this loop, and then the creation times out at 5 minutes and rolls back the entire stack.

I’ve tried deleting and rebuilding the stack a few times now and keep getting the same issue. Was there an update to the library that might be causing this @matt.mcclean ?

ehrene · May 30, 2020, 10:53am

It looks like the issue i’m having @adaley222 is with the latest package as well. i have done the configuration manually by running the script commands in the terminal and it’s the following line that either never completes or takes 20+ minutes. it has gotten hung up on ‘resolving environment’ and never progressed and it’s also worked. frustrating but at least you can use a terminal session in sagemaker to do the config necessary to get rolling.

conda install -y fastai -c fastai

maivel · May 31, 2020, 11:58am

I’ve been using Sagemaker for quite a while now and installing requirements with conda sometimes takes 20-30 mins. I’m not sure why, because sometimes it manages to install it under 5 mins as well. Sagemaker start procedure times out and fails if the on-start lifecycle script does not finish under 5 mins. See here.

Fix
Include nohup for the EOF section. This is my customized on-start script:

#!/bin/bash

echo "Disabling timeout"
nohup sudo -b -u ec2-user -i << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

echo "Install a new kernel for fastai with name 'Python 3'"
source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch

echo "Update fastai library"
conda install -y fastai -c fastai

echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user

echo "Restarting jupyter notebook server"
pkill -f jupyter-notebook

echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v3
git pull

echo "Finished running onStart script"
EOF

ehrene · June 2, 2020, 11:14am

@maivel, thank you. I tried the nohup (‘no hang-up’) in my existing script and that seemed to help. i’m not sure of the differences in the sudo line you shared but included that in my script as well. This seems to have solved the issue, and was much appreciated!

mariasimon · January 2, 2021, 7:52am

@matt.mcclean or anyone who can help me out. I have been stuck for hours now.

I requested a service limit increase to use ml.p2.xlarge or the ml.p3.2xlarge instance and my region is Middle East (Bahrain).
Does anyone has a launch stack configuration file for Middle East (Bahrain) region ?
or shall i use choose any other region? how will that affect me if i proceed with another region?