Platform: Amazon SageMaker - AWS

ganesh.bhat · April 27, 2020, 5:38pm

Ok. I think notes version too should be there and let the student decide which one he/she prefers.

I prefer the notes one so that I can understand what is happening and the reasoning. I am doing the deep learning course for the first time and the notes are very helpful while running the code.

Would request if both are available.

Thanks
Ganesh

FraPochetti · April 27, 2020, 6:04pm

Totally get your point. I think the easiest then is to edit the CloudFormation template and git pull from the fastbook repo.
If you have already created the sagemaker instance and all the AWS stack, probably the easiest is to edit the script executed at notebook boot time.
Does this make sense?

ganesh.bhat · April 27, 2020, 6:06pm

Need some clarity please. Where do I find these scripts?

FraPochetti · April 27, 2020, 6:22pm

Sure.
This is the CloudFormation template @matt.mcclean shared, and that can be used to create the fastai2 AWS stack.

If you look into it you would notice that in some key places there are git references to the course-v4 repo (such as git clone https://github.com/fastai/course-v4.git /home/ec2-user/SageMaker/course-v4).
Just replace the course-v4 repo with the fastbook repo and you should be good to go.
With the edited template, you can create a new AWS stack.

ganesh.bhat · May 3, 2020, 6:03am

I am getting a rollback message after replacing the git URL for fastbook. Any specific things that I need to take care of other than replacing the URL?

“2020-05-03 11:27:04 UTC+0530 fastai2
ROLLBACK_IN_PROGRESS The following resource(s) failed to create: [FastaiNotebookInstance]. . Rollback requested by user.
2020-05-03 11:27:04 UTC+0530 FastaiNotebookInstance
CREATE_FAILED Notebook Instance Lifecycle Config ‘arn:aws:sagemaker:ap-south-1:247607369165:notebook-instance-lifecycle-config/fastainblifecycleconfig-iaz70odk3jdm’ for Notebook Instance ‘arn:aws:sagemaker:ap-south-1:247607369165:notebook-instance/fastai2’ took longer than 5 minutes. Please check your CloudWatch logs for more details if your Notebook Instance has Internet access”

FraPochetti · May 3, 2020, 9:18am

Can you please check your CloudWatch logs for more details around the issue?

It should be the OnCreate logs batch in your case as you failed creating the resource altogether.

ganesh.bhat · May 3, 2020, 9:41am

I only found this error in LifecycleConfigOnStart
ERROR: fastai 1.0.61 requires nvidia-ml-py3, which is not installed.

Added the below pip install for the library in the yml but still the stack is not getting created.

echo “Update fastai library”
pip install fastai2
pip install nvidia-ml-py3

FraPochetti · May 3, 2020, 10:11am

It is possible yes.
Look here.
I tried it myself, deploying a plain PyTorch model on Lambda (here).

FraPochetti · May 3, 2020, 10:12am

Can you share your exact CloudFormation script?
I will try replicating your issue.

ganesh.bhat · May 3, 2020, 10:18am

Description: “Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance.”

Parameters:
NotebookName:
Type: String
Default: fastai2
Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2.

InstanceType:
Type: String
Default: ml.p2.xlarge
AllowedValues:
- ml.p2.xlarge
- ml.p3.2xlarge
Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.

VolumeSize:
Type: Number
Default: 50
MinValue: 5
MaxValue: 16384
ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
Description: Enter the size of the EBS volume in GB.

Resources:
SageMakerIamRole:
Type: “AWS::IAM::Role”
Properties:
AssumeRolePolicyDocument:
Version: “2012-10-17”
Statement:
-
Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
Path: “/”
ManagedPolicyArns:
- “arn:aws:iam::aws:policy/AmazonSageMakerFullAccess”

FastaiNbLifecycleConfig:
Type: “AWS::SageMaker::NotebookInstanceLifecycleConfig”
Properties:
OnCreate:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo “Creating symlinks”
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

          # clone the course notebooks
          echo "Clone the course repo"
          git clone https://github.com/fastai/fastbook.git /home/ec2-user/SageMaker/course-v4

          echo "Finished running onCreate script"
          EOF
            
  OnStart:
    - Content:
        Fn::Base64: |
          #!/bin/bash

          sudo -H -i -u ec2-user bash << EOF
          echo "Creating symlinks"
          [ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
          [ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

          echo "install fastai dependencies https://github.com/fastai/fastai2"
          conda install -y -c fastai -c pytorch fastai

          echo "Install a new kernel for fastai with name 'Python 3'"
          python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

          # uncomment if you want to update PyTorch on every start
          #echo "Update PyTorch library"
          #conda install -y pytorch torchvision -c pytorch

          echo "Update fastai library"
          nohup pip install fastai2
          nohup pip install nvidia-ml-py3

          echo "Install the deps for the course"
          nohup pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

          echo "Install jupyter nbextension"
          source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
          nohup pip install jupyter_contrib_nbextensions
          nohup jupyter contrib nbextensions install --user

          echo "Restarting jupyter notebook server"
          pkill -f jupyter-notebook

          echo "Getting latest version of fastai course"
          cd /home/ec2-user/SageMaker/course-v4
          git pull

          echo "Finished running onStart script"
          EOF

FastaiNotebookInstance:
Type: “AWS::SageMaker::NotebookInstance”
Properties:
InstanceType: !Ref InstanceType
LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName
NotebookInstanceName: !Ref NotebookName
RoleArn: !GetAtt SageMakerIamRole.Arn
VolumeSizeInGB: !Ref VolumeSize

ganesh.bhat · May 3, 2020, 10:20am

I have added “nohup” wherever there is a pip install command based on what I read on AWS.

If the cloudformation template runs beyond 5 minutes (sagemaker), AWS terminates & rolls it back. Only way to stop rolling back is adding nohup.

The template is the copy of course-v4 github code except that I have changed the github URL to fastbook.

FraPochetti · May 3, 2020, 10:21am

ok, will look into it.
keep you posted.

FraPochetti · May 3, 2020, 10:57am

The below script worked for me.
It correctly created the AWS stack. I am able to launch the SageMaker notebook and, as you can see, the fastbook repo is in there.
Can you try with this one, please?
Note: I am deploying in the eu-west-1 region.

Description: "Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance."

Parameters:
  NotebookName:
    Type: String
    Default: fastai2test
    Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2test.

  InstanceType:
    Type: String
    Default: ml.p2.xlarge
    AllowedValues:
      - ml.p2.xlarge
      - ml.p3.2xlarge
    Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.

  VolumeSize:
    Type: Number
    Default: 50
    MinValue: 5
    MaxValue: 16384
    ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
    Description: Enter the size of the EBS volume in GB.

Resources:
  SageMakerIamRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: Allow
            Principal:
              Service: sagemaker.amazonaws.com
            Action: sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
      
  FastaiNbLifecycleConfig:
    Type: "AWS::SageMaker::NotebookInstanceLifecycleConfig"
    Properties:
      OnCreate:
        - Content:
            Fn::Base64: |
              #!/bin/bash
              sudo -H -i -u ec2-user bash << EOF
              # create symlinks to EBS volume
              echo "Creating symlinks"
              mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
              mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

              # clone the course notebooks
              echo "Clone the course repo"
              git clone https://github.com/fastai/fastbook.git /home/ec2-user/SageMaker/fastbook

              echo "Finished running onCreate script"
              EOF
                
      OnStart:
        - Content:
            Fn::Base64: |
              #!/bin/bash

              sudo -H -i -u ec2-user bash << EOF
              echo "Creating symlinks"
              [ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
              [ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

              echo "install fastai dependencies https://github.com/fastai/fastai2"
              conda install -y -c fastai -c pytorch fastai

              echo "Install a new kernel for fastai with name 'Python 3'"
              python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

              # uncomment if you want to update PyTorch on every start
              #echo "Update PyTorch library"
              #conda install -y pytorch torchvision -c pytorch

              echo "Update fastai library"
              pip install fastai2

              echo "Install the deps for the course"
              pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

              echo "Install jupyter nbextension"
              source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
              pip install jupyter_contrib_nbextensions
              jupyter contrib nbextensions install --user

              echo "Restarting jupyter notebook server"
              pkill -f jupyter-notebook

              echo "Getting latest version of fastai course"
              cd /home/ec2-user/SageMaker/fastbook
              git pull

              echo "Finished running onStart script"
              EOF

  FastaiNotebookInstance:
    Type: "AWS::SageMaker::NotebookInstance"
    Properties:
      InstanceType: !Ref InstanceType
      LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName
      NotebookInstanceName: !Ref NotebookName
      RoleArn: !GetAtt SageMakerIamRole.Arn
      VolumeSizeInGB: !Ref VolumeSize

ganesh.bhat · May 3, 2020, 11:23am

FraPochetti:

Description: “Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance.” Parameters: NotebookName: Type: String Default: fastai2test Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2test. InstanceType: Type: String Default: ml.p2.xlarge AllowedValues: - ml.p2.xlarge - ml.p3.2xlarge Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge. VolumeSize: Type: Number Default: 50 MinValue: 5 MaxValue: 16384 ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB). Description: Enter the size of the EBS volume in GB. Resources: SageMakerIamRole: Type: “AWS::IAM::Role” Properties: AssumeRolePolicyDocument: Version: “2012-10-17” Statement: - Effect: Allow Principal: Service: sagemaker.amazonaws.com Action: sts:AssumeRole Path: “/” ManagedPolicyArns: - “arn:aws:iam::aws:policy/AmazonSageMakerFullAccess” FastaiNbLifecycleConfig: Type: “AWS::SageMaker::NotebookInstanceLifecycleConfig” Properties: OnCreate: - Content: Fn::Base64: | #!/bin/bash sudo -H -i -u ec2-user bash << EOF # create symlinks to EBS volume echo “Creating symlinks” mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai # clone the course notebooks echo “Clone the course repo” git clone GitHub - fastai/fastbook: The fastai book, published as Jupyter Notebooks /home/ec2-user/SageMaker/fastbook echo “Finished running onCreate script” EOF OnStart: - Content: Fn::Base64: | #!/bin/bash sudo -H -i -u ec2-user bash << EOF echo “Creating symlinks” [ ! -L “/home/ec2-user/.torch” ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch [ ! -L “/home/ec2-user/.fastai” ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai echo “install fastai dependencies GitHub - fastai/fastai2: Temporary home for fastai v2 while it's being developed” conda install -y -c fastai -c pytorch fastai echo “Install a new kernel for fastai with name ‘Python 3’” python -m ipykernel install --name ‘fastai’ --display-name ‘Python 3’ --user # uncomment if you want to update PyTorch on every start #echo “Update PyTorch library” #conda install -y pytorch torchvision -c pytorch echo “Update fastai library” pip install fastai2 echo “Install the deps for the course” pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece echo “Install jupyter nbextension” source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv pip install jupyter_contrib_nbextensions jupyter contrib nbextensions install --user echo “Restarting jupyter notebook server” pkill -f jupyter-notebook echo “Getting latest version of fastai course” cd /home/ec2-user/SageMaker/fastbook git pull echo “Finished running onStart script” EOF FastaiNotebookInstance: Type: “AWS::SageMaker::NotebookInstance” Properties: InstanceType: !Ref InstanceType LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName NotebookInstanceName: !Ref NotebookName RoleArn: !GetAtt SageMakerIamRole.Arn VolumeSizeInGB: !Ref VolumeSize

It again rolled back with the script above. Did you use default GPU & storage values?
I am running it in Mumbai region.

FraPochetti · May 3, 2020, 11:26am

I just uploaded the script and run the CFN template on top.
Literally nothing else.
can you try executing it in my EU region?
not sure that could be the problem.

ganesh.bhat · May 3, 2020, 11:44am

Cant run it in eu-west-1 as I run into services account issues i.e. I will need to increase the available compute on my professional account which I can’t do.

FraPochetti · May 3, 2020, 11:48am

I see. I need to think further about to fix this then.

jujubi · August 23, 2020, 2:45am

I am getting a 5 minute time on Start Script.

The following packages are causing the inconsistency:

defaults/noarch::numpydoc==0.9.2=py_0
defaults/noarch::s3fs==0.4.0=py_0
defaults/linux-64::python-language-server==0.31.9=py37_0
defaults/linux-64::spyder==4.1.2=py37_0
conda-forge/noarch::sphinx==3.0.4=py_0

jujubi · August 24, 2020, 2:33am

In the following lines, should fastai2 be changed to fastai?

          echo "Update fastai library"
          pip install fastai2

          echo "Install the deps for the course"
          pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

pl3 · August 24, 2020, 3:58pm

Anyone have luck using data parallel (dp) or distributed data parallel (ddp) with sagemaker instances or training jobs? I have tried using learner.parallel_ctx and learner.distributed_ctx on an ml.p3.8xlarge training instance (4x V100), but it is training at same speed as p3.2xlarge.

Based off reading this post, Distributed and parallel training... explained - #6 by pierreguillou, it seems using ddp won’t work since sagemaker will kick off 1 python process when launching a train script. It also seems with the p3.8xlarge, the GPUs are only available as parallel not distributed:

rank_distrib() == 0
num_distrib() == 0
torch.cuda.device_count() == 4

Despite trying both distributed and parallel ctx, can’t seem to get the increase to multiple GPUs working.

Edit - Made a post detailing what I’ve tried. Haven’t had much luck with anything more than 1 GPU with SageMaker.