Totally get your point. I think the easiest then is to edit the CloudFormation template and git pull from the fastbook repo.
If you have already created the sagemaker instance and all the AWS stack, probably the easiest is to edit the script executed at notebook boot time.
Does this make sense?
Need some clarity please. Where do I find these scripts?
Sure.
This is the CloudFormation template @matt.mcclean shared, and that can be used to create the fastai2 AWS stack.
If you look into it you would notice that in some key places there are git
references to the course-v4
repo (such as git clone https://github.com/fastai/course-v4.git /home/ec2-user/SageMaker/course-v4
).
Just replace the course-v4
repo with the fastbook
repo and you should be good to go.
With the edited template, you can create a new AWS stack.
I am getting a rollback message after replacing the git URL for fastbook. Any specific things that I need to take care of other than replacing the URL?
“2020-05-03 11:27:04 UTC+0530 fastai2
ROLLBACK_IN_PROGRESS The following resource(s) failed to create: [FastaiNotebookInstance]. . Rollback requested by user.
2020-05-03 11:27:04 UTC+0530 FastaiNotebookInstance
CREATE_FAILED Notebook Instance Lifecycle Config ‘arn:aws:sagemaker:ap-south-1:247607369165:notebook-instance-lifecycle-config/fastainblifecycleconfig-iaz70odk3jdm’ for Notebook Instance ‘arn:aws:sagemaker:ap-south-1:247607369165:notebook-instance/fastai2’ took longer than 5 minutes. Please check your CloudWatch logs for more details if your Notebook Instance has Internet access”
Can you please check your CloudWatch logs for more details around the issue?
It should be the
OnCreate
logs batch in your case as you failed creating the resource altogether.
I only found this error in LifecycleConfigOnStart
ERROR: fastai 1.0.61 requires nvidia-ml-py3, which is not installed.
Added the below pip install for the library in the yml but still the stack is not getting created.
echo “Update fastai library”
pip install fastai2
pip install nvidia-ml-py3
Can you share your exact CloudFormation script?
I will try replicating your issue.
Description: “Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance.”
Parameters:
NotebookName:
Type: String
Default: fastai2
Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2.
InstanceType:
Type: String
Default: ml.p2.xlarge
AllowedValues:
- ml.p2.xlarge
- ml.p3.2xlarge
Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.
VolumeSize:
Type: Number
Default: 50
MinValue: 5
MaxValue: 16384
ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
Description: Enter the size of the EBS volume in GB.
Resources:
SageMakerIamRole:
Type: “AWS::IAM::Role”
Properties:
AssumeRolePolicyDocument:
Version: “2012-10-17”
Statement:
-
Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
Path: “/”
ManagedPolicyArns:
- “arn:aws:iam::aws:policy/AmazonSageMakerFullAccess”
FastaiNbLifecycleConfig:
Type: “AWS::SageMaker::NotebookInstanceLifecycleConfig”
Properties:
OnCreate:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo “Creating symlinks”
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/fastbook.git /home/ec2-user/SageMaker/course-v4
echo "Finished running onCreate script"
EOF
OnStart:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "install fastai dependencies https://github.com/fastai/fastai2"
conda install -y -c fastai -c pytorch fastai
echo "Install a new kernel for fastai with name 'Python 3'"
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user
# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch
echo "Update fastai library"
nohup pip install fastai2
nohup pip install nvidia-ml-py3
echo "Install the deps for the course"
nohup pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece
echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
nohup pip install jupyter_contrib_nbextensions
nohup jupyter contrib nbextensions install --user
echo "Restarting jupyter notebook server"
pkill -f jupyter-notebook
echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v4
git pull
echo "Finished running onStart script"
EOF
FastaiNotebookInstance:
Type: “AWS::SageMaker::NotebookInstance”
Properties:
InstanceType: !Ref InstanceType
LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName
NotebookInstanceName: !Ref NotebookName
RoleArn: !GetAtt SageMakerIamRole.Arn
VolumeSizeInGB: !Ref VolumeSize
I have added “nohup” wherever there is a pip install command based on what I read on AWS.
If the cloudformation template runs beyond 5 minutes (sagemaker), AWS terminates & rolls it back. Only way to stop rolling back is adding nohup.
The template is the copy of course-v4 github code except that I have changed the github URL to fastbook.
ok, will look into it.
keep you posted.
The below script worked for me.
It correctly created the AWS stack. I am able to launch the SageMaker notebook and, as you can see, the fastbook
repo is in there.
Can you try with this one, please?
Note: I am deploying in the eu-west-1
region.
Description: "Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance."
Parameters:
NotebookName:
Type: String
Default: fastai2test
Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2test.
InstanceType:
Type: String
Default: ml.p2.xlarge
AllowedValues:
- ml.p2.xlarge
- ml.p3.2xlarge
Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.
VolumeSize:
Type: Number
Default: 50
MinValue: 5
MaxValue: 16384
ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
Description: Enter the size of the EBS volume in GB.
Resources:
SageMakerIamRole:
Type: "AWS::IAM::Role"
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
Path: "/"
ManagedPolicyArns:
- "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
FastaiNbLifecycleConfig:
Type: "AWS::SageMaker::NotebookInstanceLifecycleConfig"
Properties:
OnCreate:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo "Creating symlinks"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/fastbook.git /home/ec2-user/SageMaker/fastbook
echo "Finished running onCreate script"
EOF
OnStart:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "install fastai dependencies https://github.com/fastai/fastai2"
conda install -y -c fastai -c pytorch fastai
echo "Install a new kernel for fastai with name 'Python 3'"
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user
# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch
echo "Update fastai library"
pip install fastai2
echo "Install the deps for the course"
pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece
echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user
echo "Restarting jupyter notebook server"
pkill -f jupyter-notebook
echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/fastbook
git pull
echo "Finished running onStart script"
EOF
FastaiNotebookInstance:
Type: "AWS::SageMaker::NotebookInstance"
Properties:
InstanceType: !Ref InstanceType
LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName
NotebookInstanceName: !Ref NotebookName
RoleArn: !GetAtt SageMakerIamRole.Arn
VolumeSizeInGB: !Ref VolumeSize
It again rolled back with the script above. Did you use default GPU & storage values?
I am running it in Mumbai region.
I just uploaded the script and run the CFN template on top.
Literally nothing else.
can you try executing it in my EU region?
not sure that could be the problem.
Cant run it in eu-west-1 as I run into services account issues i.e. I will need to increase the available compute on my professional account which I can’t do.
I see. I need to think further about to fix this then.
I am getting a 5 minute time on Start Script.
The following packages are causing the inconsistency:
- defaults/noarch::numpydoc==0.9.2=py_0
- defaults/noarch::s3fs==0.4.0=py_0
- defaults/linux-64::python-language-server==0.31.9=py37_0
- defaults/linux-64::spyder==4.1.2=py37_0
- conda-forge/noarch::sphinx==3.0.4=py_0
In the following lines, should fastai2 be changed to fastai?
echo "Update fastai library"
pip install fastai2
echo "Install the deps for the course"
pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece
Anyone have luck using data parallel (dp) or distributed data parallel (ddp) with sagemaker instances or training jobs? I have tried using learner.parallel_ctx
and learner.distributed_ctx
on an ml.p3.8xlarge
training instance (4x V100), but it is training at same speed as p3.2xlarge.
Based off reading this post, Distributed and parallel training... explained, it seems using ddp won’t work since sagemaker will kick off 1 python process when launching a train script. It also seems with the p3.8xlarge, the GPUs are only available as parallel not distributed:
rank_distrib() == 0
num_distrib() == 0
torch.cuda.device_count() == 4
Despite trying both distributed and parallel ctx, can’t seem to get the increase to multiple GPUs working.
Edit - Made a post detailing what I’ve tried. Haven’t had much luck with anything more than 1 GPU with SageMaker.
Anyone having problems importing fastbook? I’m getting stuck when using the fastai2 kernel.
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
This yields the following error:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-7-2b820b2b946f> in <module>
1 #hide
2 get_ipython().system('pip install -Uqq fastbook')
----> 3 import fastbook
4 fastbook.setup_book()
ModuleNotFoundError: No module named 'fastbook'