Deployment Platform: Amazon SageMaker

Hi all,

I have created a new deployment guide showing how to take your trained fastai model and deploy to production using the Amazon SageMaker model hosting service.

Details outlined here: https://course.fast.ai/deployment_amzn_sagemaker.html

Feedback welcome!

9 Likes

Matt, these guides are really helpful, thanks for putting them together. I also like the AWS Lambda deployment guide and will be checking that out.

Do you have any tips for training using SageMaker on a large number of images (500GB+). I was wondering specifically if there were a way to create a fastai databunch directly from a list of files in S3, where the images are streamed in per batch (so you don’t have to download 500GB to your local volume)? I’m planning on investigating, but if you had any tips to jumpstart that effort I’d really appreciate it!

Ok, it looks like I found an answer to my question in this post from Julien Simon, which describes how he handled ImageNet. He suggests downloading to an EBS volume and then taking a snapshot of it.

I guess the problem with reading directly from S3 is that the I/O will be way too slow, so this seems like a good compromise between cost and performance.

SageMaker will automatically download your training data from S3 to your instance’s EBS volume. This is a key feature of using SageMaker. This means you don’t need to create a DataBunch to read from S3 as you can configure your training script to read from a local EBS volume. Your data set is not too large as you can have up to 16 TB of training data based on the largest EBS volume size. The time it takes to download data from S3 will depend on the instance type and EBS volume IO limits as you achieve 25 Gbps data transfer between S3 and EC2.

I suggest taking a look at this example notebook which shows how to train a fastai model with SageMaker using the SageMaker SDK with PyTorch support - https://github.com/mattmcclean/sagemaker-fastai-examples/blob/master/lesson1/lesson1_local_train_deploy_sagemaker.ipynb

2 Likes

Hi Matt,

I am facing below issue in CW while deploying the model into Sagemaker

16:58:58

Creating DataBunch object

16:58:58

[2019-03-11 22:58:57 +0000] [30] [ERROR] Error handling request /ping

16:58:58

Traceback (most recent call last): File “/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_functions.py”, line 85, in wrapper return fn(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/food.py”, line 15, in model_fn empty_data = ImageDataBunch.load_empty(path) File “/usr/local/lib/python3.6/dist-packages/fastai/data_block.py”, line 715, in _databunch_load_empt

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_functions.py”, line 85, in wrapper
return fn(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/food.py”, line 15, in model_fn
empty_data = ImageDataBunch.load_empty(path)
File “/usr/local/lib/python3.6/dist-packages/fastai/data_block.py”, line 715, in _databunch_load_empty
sd = LabelLists.load_empty(path, fn=fname)
File “/usr/local/lib/python3.6/dist-packages/fastai/data_block.py”, line 563, in load_empty
return LabelLists.load_state(path, state)
File “/usr/local/lib/python3.6/dist-packages/fastai/data_block.py”, line 554, in load_state
train_ds = LabelList.load_state(path, state)
File “/usr/local/lib/python3.6/dist-packages/fastai/data_block.py”, line 664, in load_state
x = state[‘x_cls’]([], path=path, processor=state[‘x_proc’], ignore_empty=True)

16:58:58

KeyError: ‘x_cls’

16:58:58

During handling of the above exception, another exception occurred:

16:58:58

Traceback (most recent call last): File “/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py”, line 56, in handle self.handle_request(listener_name, req, client, addr) File “/usr/local/lib/python3.6/dist-packages/gunicorn/workers/ggevent.py”, line 160, in handle_request addr) File “/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py”, line 107, in

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py”, line 56, in handle
self.handle_request(listener_name, req, client, addr)
File “/usr/local/lib/python3.6/dist-packages/gunicorn/workers/ggevent.py”, line 160, in handle_request
addr)
File “/usr/local/lib/python3.6/dist-packages/gunicorn/workers/base_async.py”, line 107, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File “/usr/local/lib/python3.6/dist-packages/sagemaker_pytorch_container/serving.py”, line 107, in main
user_module_transformer.initialize()
File “/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_transformer.py”, line 157, in initialize
self._model = self._model_fn(_env.model_dir)
File “/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_functions.py”, line 87, in wrapper
six.reraise(error_class, error_class(e), sys.exc_info()[2])
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 692, in reraise
raise value.with_traceback(tb)
File “/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_functions.py”, line 85, in wrapper
return fn(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/food.py”, line 15, in model_fn

16:58:58

10.32.0.2 - - [11/Mar/2019:22:58:57 +0000] “GET /ping HTTP/1.1” 500 141 “-” “AHC/2.0” empty_data = ImageDataBunch.load_empty(path) File “/usr/local/lib/python3.6/dist-packages/fastai/data_block.py”, line 715, in _databunch_load_empty sd = LabelLists.load_empty(path, fn=fname) File “/usr/local/lib/python3.6/dist-packages/fastai/data_block.py”, line 563, in load_empty return LabelLis

16:58:58

sagemaker_containers._errors.ClientError: ‘x_cls’

I have found the issue :
If you have notebook fastai version less than ‘1.0.48’ then you can follow the below instructions to deploy the model in Sagemaker. But if your fastai version ‘1.0.48’ then you just need to export the export.pkl file and use model_fn function inside serve.py like below: as with new version few function call was updated. And you no need to crate empty data bunch object anymore.

loads the model into memory from disk and returns it

def model_fn(model_dir):
logger.info(‘model_fn’)
path = Path(model_dir)
print(‘Creating learner object’)
learn = load_learner(path)
return learn

1 Like

Thanks @amit_aec_it. I have submitted a PR to fix this issue here.

Thanks @matt.mcclean

Hi @matt.mcclean, any chance you’ve figured out a way to incorporate SageMaker’s Pipe Mode streaming? Would like to try training COCO-Stuff dataset and it’s huge to download to the EC2 instance :frowning_face:

It is still not officially supported in the SageMaker SDK for PyTorch (see issue here). You could consider implementing logic in your train() method to read data in Pipe-mode.

An example Python notebook with example code can be found here: https://github.com/awslabs/amazon-sagemaker-examples/blob/80df7d61a4bf14a11f0442020e2003a7c1f78115/advanced_functionality/pipe_bring_your_own/pipe_bring_your_own.ipynb

1 Like

Hi Matt, awesome work!!
I think it could be useful to highlight this previous (and up-to-date) post
Running fast.ai notebooks with Amazon SageMaker
Thanks!!

@matt.mcclean Thank you for creating a great documentation for deploying fastai models using SageMaker. I tried to deploy the fastai model (that I have trained using a Jupyter Notebook in EC2 instance) as a SageMaker Endpoint following the documentation, but the endpoint creation fails and I can see the error as attached below.

Does the documentation here also contains building a docker container with fastai libraries and pushing them to ECR? If not, the documentation is only for the models trained in a SageMaker Notebook instance and not EC2 Notebooks?

Hey @matt.mcclean, thank you for the documentation! I was able to solve @arul_bharathi’s problem by replacing class with cls but after that, the deployment doesn’t work because the load_learner function could not be found. Can you help resolve this issue? I am using the exact code you provided in your documentation.

algo-1-k8580_1  | Successfully built serve
algo-1-k8580_1  | Installing collected packages: serve
algo-1-k8580_1  | Successfully installed serve-1.0.0
algo-1-k8580_1  | You are using pip version 18.1, however version 19.2.1 is available.
algo-1-k8580_1  | You should consider upgrading via the 'pip install --upgrade pip' command.
algo-1-k8580_1  | [2019-07-25 07:47:45 +0000] [28] [ERROR] Error handling request /ping
algo-1-k8580_1  | Traceback (most recent call last):
algo-1-k8580_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_functions.py", line 85, in wrapper
algo-1-k8580_1  |     return fn(*args, **kwargs)
algo-1-k8580_1  |   File "/usr/local/lib/python3.6/dist-packages/serve.py", line 14, in model_fn
algo-1-k8580_1  |     learn = load_learner(model_dir, fname='resnet50.pkl')
algo-1-k8580_1  | NameError: name 'load_learner' is not defined
1 Like

Here is the blog post just published today which talks about building fastai model with Amazon SageMaker

Thank you for your answer @amit_aec_it!
I followed the instructions exactly. However when trying to run the first cell, I already get an error. How is this possible?

ContextualVersionConflict: (requests 2.22.0 (/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages), Requirement.parse('requests<2.21,>=2.20.0'), {'sagemaker'})
1 Like

Hi Fabian. Sorry for the inconvenience. AWS just fixed the issue. Could you please delete your old cloud formation stack which will delete your sagemaker notebook instance. Post deletion please run the steps as mentioned in the blog. Let us know if that works.

Thank you for pointing me to the official instructions from AWS!
I was now able to get it to work out of the box :slight_smile:

One question though:
In their pets.py file they save the model via learn.save(model_path/f'{args.model_arch}') instead of exporting the model for inference.

When loading the model they take a tedious way of loading the model like so:

empty_data = ImageDataBunch.load_empty(path)
arch_name = os.path.splitext(os.path.split(glob.glob(f'{model_dir}/resnet*.pth')[0])[1])[0]
print(f'Model architecture is: {arch_name}')
arch = getattr(models, arch_name)    
learn = create_cnn(empty_data, arch, pretrained=False).load(path/f'{arch_name}')

Why are they not using learn.export() and for inference the load_learner() function?

I was able to identify, that the default deployment container in Sagemaker uses fastai v1.0.39. This version of fastai does not yet have the load_learner() function implemented.
Do you know how to set specific package versions through the Sagemaker Python SDK?
Best regards

1 Like

You are right Fabian. load_learner function comes with 1.0.48. Once AWS has pinned 1.0.48 or higher then you can use load_learner(). Now if you are asking how to use specific version higher than 1.0.39 then you have to follow BYOC (bring your own container) with fastai version installed approach…

1 Like

@amit_aec_it If you need to BYOC. I would like to recommend you guys BentoML

BentoML is an open source python framework for serving and operating machine learning models, making it easy to promote trained models into high performance prediction services.

After you spec out your machine learning service. It takes one command to deploy to Sagemaker.

You can check out the fastai example notebook here: https://colab.research.google.com/github/bentoml/gallery/blob/master/fast-ai/pet-classification/notebook.ipynb

And you can checkout an example notebook deploy to Sagemaker here: https://github.com/bentoml/BentoML/tree/master/examples/deploy-with-sagemaker

Let me know what you think. Love to get your feedback

Cheers

Bo

2 Likes