Platform: Amazon SageMaker - AWS

Ok. I think notes version too should be there and let the student decide which one he/she prefers.

I prefer the notes one so that I can understand what is happening and the reasoning. I am doing the deep learning course for the first time and the notes are very helpful while running the code.

Would request if both are available.

Thanks
Ganesh

Totally get your point. I think the easiest then is to edit the CloudFormation template and git pull from the fastbook repo.
If you have already created the sagemaker instance and all the AWS stack, probably the easiest is to edit the script executed at notebook boot time.
Does this make sense?

Need some clarity please. Where do I find these scripts?

Sure.
This is the CloudFormation template @matt.mcclean shared, and that can be used to create the fastai2 AWS stack.

If you look into it you would notice that in some key places there are git references to the course-v4 repo (such as git clone https://github.com/fastai/course-v4.git /home/ec2-user/SageMaker/course-v4).
Just replace the course-v4 repo with the fastbook repo and you should be good to go.
With the edited template, you can create a new AWS stack.

1 Like

I am getting a rollback message after replacing the git URL for fastbook. Any specific things that I need to take care of other than replacing the URL?

“2020-05-03 11:27:04 UTC+0530 fastai2
ROLLBACK_IN_PROGRESS The following resource(s) failed to create: [FastaiNotebookInstance]. . Rollback requested by user.
2020-05-03 11:27:04 UTC+0530 FastaiNotebookInstance
CREATE_FAILED Notebook Instance Lifecycle Config ‘arn:aws:sagemaker:ap-south-1:247607369165:notebook-instance-lifecycle-config/fastainblifecycleconfig-iaz70odk3jdm’ for Notebook Instance ‘arn:aws:sagemaker:ap-south-1:247607369165:notebook-instance/fastai2’ took longer than 5 minutes. Please check your CloudWatch logs for more details if your Notebook Instance has Internet access”

Can you please check your CloudWatch logs for more details around the issue?


It should be the OnCreate logs batch in your case as you failed creating the resource altogether.

I only found this error in LifecycleConfigOnStart
ERROR: fastai 1.0.61 requires nvidia-ml-py3, which is not installed.

Added the below pip install for the library in the yml but still the stack is not getting created.

echo “Update fastai library”
pip install fastai2
pip install nvidia-ml-py3

It is possible yes.
Look here.
I tried it myself, deploying a plain PyTorch model on Lambda (here).

1 Like

Can you share your exact CloudFormation script?
I will try replicating your issue.

1 Like

Description: “Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance.”

Parameters:
NotebookName:
Type: String
Default: fastai2
Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2.

InstanceType:
Type: String
Default: ml.p2.xlarge
AllowedValues:
- ml.p2.xlarge
- ml.p3.2xlarge
Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.

VolumeSize:
Type: Number
Default: 50
MinValue: 5
MaxValue: 16384
ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
Description: Enter the size of the EBS volume in GB.

Resources:
SageMakerIamRole:
Type: “AWS::IAM::Role”
Properties:
AssumeRolePolicyDocument:
Version: “2012-10-17”
Statement:
-
Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
Path: “/”
ManagedPolicyArns:
- “arn:aws:iam::aws:policy/AmazonSageMakerFullAccess”

FastaiNbLifecycleConfig:
Type: “AWS::SageMaker::NotebookInstanceLifecycleConfig”
Properties:
OnCreate:
- Content:
Fn::Base64: |
#!/bin/bash
sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo “Creating symlinks”
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

          # clone the course notebooks
          echo "Clone the course repo"
          git clone https://github.com/fastai/fastbook.git /home/ec2-user/SageMaker/course-v4

          echo "Finished running onCreate script"
          EOF
            
  OnStart:
    - Content:
        Fn::Base64: |
          #!/bin/bash

          sudo -H -i -u ec2-user bash << EOF
          echo "Creating symlinks"
          [ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
          [ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

          echo "install fastai dependencies https://github.com/fastai/fastai2"
          conda install -y -c fastai -c pytorch fastai

          echo "Install a new kernel for fastai with name 'Python 3'"
          python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

          # uncomment if you want to update PyTorch on every start
          #echo "Update PyTorch library"
          #conda install -y pytorch torchvision -c pytorch

          echo "Update fastai library"
          nohup pip install fastai2
          nohup pip install nvidia-ml-py3

          echo "Install the deps for the course"
          nohup pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

          echo "Install jupyter nbextension"
          source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
          nohup pip install jupyter_contrib_nbextensions
          nohup jupyter contrib nbextensions install --user

          echo "Restarting jupyter notebook server"
          pkill -f jupyter-notebook

          echo "Getting latest version of fastai course"
          cd /home/ec2-user/SageMaker/course-v4
          git pull

          echo "Finished running onStart script"
          EOF

FastaiNotebookInstance:
Type: “AWS::SageMaker::NotebookInstance”
Properties:
InstanceType: !Ref InstanceType
LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName
NotebookInstanceName: !Ref NotebookName
RoleArn: !GetAtt SageMakerIamRole.Arn
VolumeSizeInGB: !Ref VolumeSize

I have added “nohup” wherever there is a pip install command based on what I read on AWS.

If the cloudformation template runs beyond 5 minutes (sagemaker), AWS terminates & rolls it back. Only way to stop rolling back is adding nohup.

The template is the copy of course-v4 github code except that I have changed the github URL to fastbook.

ok, will look into it.
keep you posted.

1 Like

The below script worked for me.
It correctly created the AWS stack. I am able to launch the SageMaker notebook and, as you can see, the fastbook repo is in there.
Can you try with this one, please?
Note: I am deploying in the eu-west-1 region.

Description: "Creates the SageMaker resources to run the fast.ai v4 (2020) course on a SageMaker notebook instance."

Parameters:
  NotebookName:
    Type: String
    Default: fastai2test
    Description: Enter the name of the SageMaker notebook instance. Deafault is fastai2test.

  InstanceType:
    Type: String
    Default: ml.p2.xlarge
    AllowedValues:
      - ml.p2.xlarge
      - ml.p3.2xlarge
    Description: Enter ml.p2.xlarge or ml.p3.2xlarge. Default is ml.p2.xlarge.

  VolumeSize:
    Type: Number
    Default: 50
    MinValue: 5
    MaxValue: 16384
    ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
    Description: Enter the size of the EBS volume in GB.

Resources:
  SageMakerIamRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: Allow
            Principal:
              Service: sagemaker.amazonaws.com
            Action: sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
      
  FastaiNbLifecycleConfig:
    Type: "AWS::SageMaker::NotebookInstanceLifecycleConfig"
    Properties:
      OnCreate:
        - Content:
            Fn::Base64: |
              #!/bin/bash
              sudo -H -i -u ec2-user bash << EOF
              # create symlinks to EBS volume
              echo "Creating symlinks"
              mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
              mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

              # clone the course notebooks
              echo "Clone the course repo"
              git clone https://github.com/fastai/fastbook.git /home/ec2-user/SageMaker/fastbook

              echo "Finished running onCreate script"
              EOF
                
      OnStart:
        - Content:
            Fn::Base64: |
              #!/bin/bash

              sudo -H -i -u ec2-user bash << EOF
              echo "Creating symlinks"
              [ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
              [ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai

              echo "install fastai dependencies https://github.com/fastai/fastai2"
              conda install -y -c fastai -c pytorch fastai

              echo "Install a new kernel for fastai with name 'Python 3'"
              python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user

              # uncomment if you want to update PyTorch on every start
              #echo "Update PyTorch library"
              #conda install -y pytorch torchvision -c pytorch

              echo "Update fastai library"
              pip install fastai2

              echo "Install the deps for the course"
              pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

              echo "Install jupyter nbextension"
              source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
              pip install jupyter_contrib_nbextensions
              jupyter contrib nbextensions install --user

              echo "Restarting jupyter notebook server"
              pkill -f jupyter-notebook

              echo "Getting latest version of fastai course"
              cd /home/ec2-user/SageMaker/fastbook
              git pull

              echo "Finished running onStart script"
              EOF

  FastaiNotebookInstance:
    Type: "AWS::SageMaker::NotebookInstance"
    Properties:
      InstanceType: !Ref InstanceType
      LifecycleConfigName: !GetAtt FastaiNbLifecycleConfig.NotebookInstanceLifecycleConfigName
      NotebookInstanceName: !Ref NotebookName
      RoleArn: !GetAtt SageMakerIamRole.Arn
      VolumeSizeInGB: !Ref VolumeSize




It again rolled back with the script above. Did you use default GPU & storage values?
I am running it in Mumbai region.

I just uploaded the script and run the CFN template on top.
Literally nothing else.
can you try executing it in my EU region?
not sure that could be the problem.

Cant run it in eu-west-1 as I run into services account issues i.e. I will need to increase the available compute on my professional account which I can’t do.

I see. I need to think further about to fix this then.

I am getting a 5 minute time on Start Script.

The following packages are causing the inconsistency:

  • defaults/noarch::numpydoc==0.9.2=py_0
  • defaults/noarch::s3fs==0.4.0=py_0
  • defaults/linux-64::python-language-server==0.31.9=py37_0
  • defaults/linux-64::spyder==4.1.2=py37_0
  • conda-forge/noarch::sphinx==3.0.4=py_0

In the following lines, should fastai2 be changed to fastai?

          echo "Update fastai library"
          pip install fastai2

          echo "Install the deps for the course"
          pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

Anyone have luck using data parallel (dp) or distributed data parallel (ddp) with sagemaker instances or training jobs? I have tried using learner.parallel_ctx and learner.distributed_ctx on an ml.p3.8xlarge training instance (4x V100), but it is training at same speed as p3.2xlarge.

Based off reading this post, Distributed and parallel training... explained, it seems using ddp won’t work since sagemaker will kick off 1 python process when launching a train script. It also seems with the p3.8xlarge, the GPUs are only available as parallel not distributed:

rank_distrib() == 0
num_distrib() == 0
torch.cuda.device_count() == 4

Despite trying both distributed and parallel ctx, can’t seem to get the increase to multiple GPUs working.

Edit - Made a post detailing what I’ve tried. Haven’t had much luck with anything more than 1 GPU with SageMaker.