Ok. I think notes version too should be there and let the student decide which one he/she prefers.
I prefer the notes one so that I can understand what is happening and the reasoning. I am doing the deep learning course for the first time and the notes are very helpful while running the code.
Totally get your point. I think the easiest then is to edit the CloudFormation template and git pull from the fastbook repo.
If you have already created the sagemaker instance and all the AWS stack, probably the easiest is to edit the script executed at notebook boot time.
Does this make sense?
Sure. This is the CloudFormation template @matt.mcclean shared, and that can be used to create the fastai2 AWS stack.
If you look into it you would notice that in some key places there are git references to the course-v4 repo (such as git clone https://github.com/fastai/course-v4.git /home/ec2-user/SageMaker/course-v4).
Just replace the course-v4 repo with the fastbook repo and you should be good to go.
With the edited template, you can create a new AWS stack.
I am getting a rollback message after replacing the git URL for fastbook. Any specific things that I need to take care of other than replacing the URL?
â2020-05-03 11:27:04 UTC+0530 fastai2
ROLLBACK_IN_PROGRESS The following resource(s) failed to create: [FastaiNotebookInstance]. . Rollback requested by user.
2020-05-03 11:27:04 UTC+0530 FastaiNotebookInstance
CREATE_FAILED Notebook Instance Lifecycle Config âarn:aws:sagemaker:ap-south-1:247607369165:notebook-instance-lifecycle-config/fastainblifecycleconfig-iaz70odk3jdmâ for Notebook Instance âarn:aws:sagemaker:ap-south-1:247607369165:notebook-instance/fastai2â took longer than 5 minutes. Please check your CloudWatch logs for more details if your Notebook Instance has Internet accessâ
Description: âCreates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance.â
Parameters:
NotebookName:
Type: String
Default: fastai2
Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2.
InstanceType:
Type: String
Default: ml.p2.xlarge
AllowedValues:
- ml.p2.xlarge
- ml.p3.2xlarge
Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.
VolumeSize:
Type: Number
Default: 50
MinValue: 5
MaxValue: 16384
ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
Description: Enter the size of the EBS volume in GB.
The below script worked for me.
It correctly created the AWS stack. I am able to launch the SageMaker notebook and, as you can see, the fastbook repo is in there.
Can you try with this one, please? Note: I am deploying in the eu-west-1 region.
Description: "Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance."
Parameters:
NotebookName:
Type: String
Default: fastai2test
Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2test.
InstanceType:
Type: String
Default: ml.p2.xlarge
AllowedValues:
- ml.p2.xlarge
- ml.p3.2xlarge
Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.
VolumeSize:
Type: Number
Default: 50
MinValue: 5
MaxValue: 16384
ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
Description: Enter the size of the EBS volume in GB.
Resources:
SageMakerIamRole:
Type: "AWS::IAM::Role"
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
Path: "/"
ManagedPolicyArns:
- "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
FastaiNbLifecycleConfig:
Type: "AWS::SageMaker::NotebookInstanceLifecycleConfig"
Properties:
OnCreate:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo "Creating symlinks"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/fastbook.git /home/ec2-user/SageMaker/fastbook
echo "Finished running onCreate script"
EOF
OnStart:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "install fastai dependencies https://github.com/fastai/fastai2"
conda install -y -c fastai -c pytorch fastai
echo "Install a new kernel for fastai with name 'Python 3'"
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user
# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch
echo "Update fastai library"
pip install fastai2
echo "Install the deps for the course"
pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece
echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user
echo "Restarting jupyter notebook server"
pkill -f jupyter-notebook
echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/fastbook
git pull
echo "Finished running onStart script"
EOF
FastaiNotebookInstance:
Type: "AWS::SageMaker::NotebookInstance"
Properties:
InstanceType: !Ref InstanceType
LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName
NotebookInstanceName: !Ref NotebookName
RoleArn: !GetAtt SageMakerIamRole.Arn
VolumeSizeInGB: !Ref VolumeSize
I just uploaded the script and run the CFN template on top.
Literally nothing else.
can you try executing it in my EU region?
not sure that could be the problem.
Cant run it in eu-west-1 as I run into services account issues i.e. I will need to increase the available compute on my professional account which I canât do.
Anyone have luck using data parallel (dp) or distributed data parallel (ddp) with sagemaker instances or training jobs? I have tried using learner.parallel_ctx and learner.distributed_ctx on an ml.p3.8xlarge training instance (4x V100), but it is training at same speed as p3.2xlarge.
Based off reading this post, Distributed and parallel training... explained, it seems using ddp wonât work since sagemaker will kick off 1 python process when launching a train script. It also seems with the p3.8xlarge, the GPUs are only available as parallel not distributed: