Platform: Amazon SageMaker - AWS

I just uploaded the script and run the CFN template on top.
Literally nothing else.
can you try executing it in my EU region?
not sure that could be the problem.

Cant run it in eu-west-1 as I run into services account issues i.e. I will need to increase the available compute on my professional account which I can’t do.

I see. I need to think further about to fix this then.

I am getting a 5 minute time on Start Script.

The following packages are causing the inconsistency:

  • defaults/noarch::numpydoc==0.9.2=py_0
  • defaults/noarch::s3fs==0.4.0=py_0
  • defaults/linux-64::python-language-server==0.31.9=py37_0
  • defaults/linux-64::spyder==4.1.2=py37_0
  • conda-forge/noarch::sphinx==3.0.4=py_0

In the following lines, should fastai2 be changed to fastai?

          echo "Update fastai library"
          pip install fastai2

          echo "Install the deps for the course"
          pip install fastai2>=0.0.11 graphviz ipywidgets matplotlib nbdev>=0.2.12 pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

Anyone have luck using data parallel (dp) or distributed data parallel (ddp) with sagemaker instances or training jobs? I have tried using learner.parallel_ctx and learner.distributed_ctx on an ml.p3.8xlarge training instance (4x V100), but it is training at same speed as p3.2xlarge.

Based off reading this post, Distributed and parallel training... explained, it seems using ddp won’t work since sagemaker will kick off 1 python process when launching a train script. It also seems with the p3.8xlarge, the GPUs are only available as parallel not distributed:

rank_distrib() == 0
num_distrib() == 0
torch.cuda.device_count() == 4

Despite trying both distributed and parallel ctx, can’t seem to get the increase to multiple GPUs working.

Edit - Made a post detailing what I’ve tried. Haven’t had much luck with anything more than 1 GPU with SageMaker.

Anyone having problems importing fastbook? I’m getting stuck when using the fastai2 kernel.

#hide
!pip install -Uqq fastbook 
import fastbook 
fastbook.setup_book()

This yields the following error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-2b820b2b946f> in <module>
  1 #hide
  2 get_ipython().system('pip install -Uqq fastbook')
----> 3 import fastbook
  4 fastbook.setup_book()

ModuleNotFoundError: No module named 'fastbook'

I get this exact same error when I try to build everything with the script provided by @FraPochetti
Still thanks a lot for the template though. I tried to add certain extra lines of code like conda updata conda --all but it didn’t help to get rid of the error message. I ended up just using one of the initially provided templates. Thank you all for your help :slight_smile:

@_Nils Could you point to the template you used? Thanks.

@FraPochetti I also get the 5 minute timeout error when launching the cloudformation stack using the template you shared. Can you confirm this template still works?

@ganesh.bhat were you ever able to get this to work? Did adding ‘nohup’ to the pip commands fix the issue?

@matt.mcclean can you confirm this setup process still works?
I can report that the cloudformation stack deploys with no issues, and able to access the Sagemaker notebook instance/jupyter. But, the notebook instance it creates does not work with the fastai notebooks.

The very first notebook: https://github.com/fastai/fastbook/blob/master/01_intro.ipynb

cannot be completed without quite a few errors, all of which suggest graphviz is not configured correctly during the setup process in the cfn script.

Sorry guys, I will maybe have time to look into this next week.
Right now it is a bit hectic.

You can create a stack using the template from here https://course.fast.ai/start_sagemaker, changing the two references to ‘course-v4’ to ‘fastbook’

pip install -r /home/ec2-user/SageMaker/fastbook/requirements.txt

DefaultCodeRepository: https://github.com/fastai/fastbook

Notebook 01 runs without errors.

this fixes it. Thanks @AlisonDavey

For anyone else who finds this thread with similar problems, I will submit a PR to fix.

1 Like

PR to fix: https://github.com/fastai/course20/pull/28

2 Likes

I selected the Frankfurt, Germany link since I live in Northern Europe, and got this response when trying to create the stack.

CREATE_FAILED The requested resource notebook-instance/ml.p2.xlarge is not available in this region (Service: AmazonSageMaker; Status Code: 400; Error Code: ResourceLimitExceeded; Request ID: 12ba24b0-cf28-4e37-aec0-e04a199ef168)

Hmm, that instance type is definitely supposed to be available in the Frankfurt region. Check out the pricing page under the “On-Demand Notebook Instances” tab and Frankfurt region.

You might need to request a quota increase.

Yes, AWS is insane.

I tried Ireland instead, but ran into having to request a quota increase. It took hours to get a response, which only repeated my request back to me. By that time I was up and running with Paperspace, so I cancelled AWS. Not going back to AWS anytime soon.

Has anyone seen this error while trying to deploy the model?
sagemaker_containers._errors.ClientError: name ‘load_learner’ is not defined

The error is coming from:
File “/usr/local/lib/python3.6/dist-packages/inference.py”, line 15, in model_fn
learn = load_learner(model_dir, fname=‘export.pkl’)
The file inference.py contains the 4 methods: model_fn , input_fn , predict_fn & output_fn

and it imports:

import logging, requests, os, io, glob, time
from fastai.vision import *

I followed the instructions in this link https://github.com/fastai/course-v3/blob/master/docs/deployment_amzn_sagemaker.md

Could it be the instance where the model is deployed does not have the right version of fastai?

This is my deploy instruction (from the link above):
predictor = model.deploy(initial_instance_count=1, instance_type=‘ml.t2.medium’)