Platform: SageMaker ✅

im having this issue as well

I had this issue too and I think I resolved it by running these steps from the “Start notebook” lifecycle configuration script directly in the terminal available from the notebook instance (very bottom option in the New dropdown) - the version I have has the pytorch updates commented out which is likely the source of the error

source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name ‘fastai’ --display-name ‘Python 3’ --user
conda install -y pytorch torchvision -c pytorch
conda install -y fastai -c fastai
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user
pkill -f jupyter-notebook

oh my goodness thank you so much. This fixed the issue

Hello, im trying to deploy a Fastai model on Sagemaker and seeing this error in cloudwatch
sagemaker_containers._errors.InstallModuleError: InstallModuleError

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py", line 56, in handle
    self.handle_request(listener_name, req, client, addr)
  File "/usr/local/lib/python3.6/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
    addr)
  File "/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py", line 107, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_pytorch_container/serving.py", line 103, in main
    user_module = modules.import_module(serving_env.module_dir, serving_env.module_name)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_modules.py", line 244, in import_module
    install(_env.code_dir)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_modules.py", line 110, in install
    _process.check_error(shlex.split(cmd), _errors.InstallModuleError, cwd=path, capture_error=capture_error)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_process.py", line 48, in check_error
    raise error_class(return_code=return_code, cmd=' '.join(cmd), output=stderr)

I have a requirements.txt that install fastai and librosa

fastai
librosa
SoundFile

Hi - Can I ask how you folks are running training overnight in sagemaker notebooks? I automatically get logged out and that suspends the training. according to this comment from amazon that’s the expected behavior, but I feel like I’ve seen Jeremy and others show the results of overnight trainings.

I could follow the instructions in that comment and execute a python script with nohup (I’m already using SaveModelCallback), but I’d prefer to see the output of each epoch like you can in a notebook.

Thanks

For future folks looking into this - I ended up largely solving this issue following the advice in the above link. I wrote a python script in the notebook instance that trained for 4 epochs saving the model each time I got a best-ever validation loss (using SaveModelCallback).

I activated the conda environment I wanted, (source activate conda_pytorch_36), then I executed that script with nohup, which ignores the hangup signal when you get disconnected from the notebook instance:
nohup python my_training_script.py

nohup will automatically save the stdout, including the training performance metrics, to a text file named nohup.out.

Dear All,

I followed this great tutorial

and in Ohio I had to increase the limit to 1
“The account-level service limit ‘ml.p2.xlarge for notebook instance usage’ is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit. (Service: AmazonSageMaker; Status Code: 400; Error Code: ResourceLimitExceeded; Request ID: b26d76fc-d6ac-4b8c-b104-ba31b226cf7e)”

waiting for a reply at the moment, anyone that experience the same issue?

I’ve forgot to turn off the machine several times and decided to automate shutting down when inactive. Turns out Jupyter exposes nice API so you can know the last time it ran something.

To get the same behaviour, just add this script to your lifecycle configuration.

If you don’t know how to do that, I’ve written a guide to do that.

Btw if you don’t want to turn off your machine after 1 hour of inactivity, increase the IDLE_TIME variable (it’s in seconds!).

Having trouble downloading files in the sagemaker notebooks. Did anyone else face this issue?

Posted this on stackoverflow as well.

Yes it is a known issue and the AWS team is looking into it

@matt.mcclean Is there any update about Elastic Inference and fastai ?

I haven’t had a chance to try it out. In theory it should work as you only need to have a PyTorch model that compiles to TorchScript. See the guide here: https://docs.aws.amazon.com/elastic-inference/latest/developerguide/ei-pytorch.html

@matt.mcclean But I think compiling fastai model to TorchScript is not possible yet ?

I have a SageMaker work account that has restricted permissions on dealing with Stacks. Are the on create and on start lifecycle scripts available anywhere? I could create the notebook instance with those.

@maivel i’m in a similar situation where i have most permissions enabled but am working out of a shared account. I’m trying to get the lifecycle configuration set up to run and am having errors. I was able to get the script content out of the cloudformation scripts referenced in the latest guide (https://course.fast.ai/start_sagemaker.html). Now i’m encountering an error on startup where things are failing to start my instance.

@matt.mcclean do you have any insight into what might be causing the errors? I’m fairly sure i copied the scripts correctly from cloud formation:

On notebook start:

#!/bin/bash

sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

echo "Install a new kernel for fastai with name 'Python 3'"
source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch

echo "Update fastai library"
conda install -y fastai -c fastai

echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user

echo "Restarting jupyter notebook server"
pkill -f jupyter-notebook

echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v3
git pull

echo "Finished running onStart script"
EOF

On create:

#!/bin/bash

sudo -H -i -u ec2-user bash << EOF

# create symlinks to EBS volume
echo "Creating symlinks"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3

echo "Finished running onCreate script"
EOF

i’ve read thru the script content and understand pretty much all of it (i think) and am not sure what could be causing the error. I don’t have sufficient access currently to see cloudwatch data for the account. I was able to run the outdated fast.ai lifecycle configs in my account tho.

i’m itching to get rolling in sagemaker and am hoping i can get this running soon!

I’m having the same issue as @ehrene. My instance was working fine last week, and now I’m getting the same error regarding the Lifecyle configuration.

It seems to get stuck installing the most recent fast.ai library. The CloudWatch logs for the notebook instance get stuck in this loop, and then the creation times out at 5 minutes and rolls back the entire stack.

I’ve tried deleting and rebuilding the stack a few times now and keep getting the same issue. Was there an update to the library that might be causing this @matt.mcclean ?

It looks like the issue i’m having @adaley222 is with the latest package as well. i have done the configuration manually by running the script commands in the terminal and it’s the following line that either never completes or takes 20+ minutes. it has gotten hung up on ‘resolving environment’ and never progressed and it’s also worked. frustrating but at least you can use a terminal session in sagemaker to do the config necessary to get rolling.

conda install -y fastai -c fastai

I’ve been using Sagemaker for quite a while now and installing requirements with conda sometimes takes 20-30 mins. I’m not sure why, because sometimes it manages to install it under 5 mins as well. Sagemaker start procedure times out and fails if the on-start lifecycle script does not finish under 5 mins. See here.

Fix
Include nohup for the EOF section. This is my customized on-start script:

#!/bin/bash

echo "Disabling timeout"
nohup sudo -b -u ec2-user -i << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

echo "Install a new kernel for fastai with name 'Python 3'"
source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch

echo "Update fastai library"
conda install -y fastai -c fastai

echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user

echo "Restarting jupyter notebook server"
pkill -f jupyter-notebook

echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v3
git pull

echo "Finished running onStart script"
EOF

@maivel, thank you. I tried the nohup (‘no hang-up’) in my existing script and that seemed to help. i’m not sure of the differences in the sudo line you shared but included that in my script as well. This seems to have solved the issue, and was much appreciated!

@matt.mcclean or anyone who can help me out. I have been stuck for hours now.

I requested a service limit increase to use ml.p2.xlarge or the ml.p3.2xlarge instance and my region is Middle East (Bahrain).
Does anyone has a launch stack configuration file for Middle East (Bahrain) region ?
or shall i use choose any other region? how will that affect me if i proceed with another region?