Productionizing models thread

(Jeremy Howard (Admin)) #1

Here’s a general thread for discussion on putting models in production.

In general, our suggestion is to do inference on CPU where possible.

28 Likes

Puting the Model Into Production: Web Apps
Moving inference to the CPU
Fastai serving
How to get an empty ConvLearner for single image prediction?
Productionizing - training in the cloud
How to get an empty ConvLearner for single image prediction?
No matching distribution found for torch >=1.0.0
(Henri Palacci) #2

Super interested in this - was playing around with the new PyTorch CPP features. If anyone’s interested, here’s how you can compile your model from lesson 1:

learn.load('stage-1') \\ or whatever your saved model is
example = torch.rand(2, 3, 224, 224) \\ dummy batch size, n_chanels, w, h
learn.model.training = False \\ disable dropout
learn.model = learn.model.cpu() \\ move to CPU
traced_script_module = torch.jit.trace(learn.model, example)
traced_script_module.save("model.pt")

This saves the model.pt file that you can use in C++ code to build a binary as shown in PyTorch C++ Documentation. You’ll get warnings due to dropout being in the model, but it should still work.

As a side note, the example worked for me out of the box without needing to use XCode or VSIde or some other C++ nightmare. Just remember to upgrade your cmake using
conda install -c anaconda cmake

I still have a ton of questions about this though:

  • now you have your compiled model - what are best practices? Call your binary from whatever backend you’re using?
  • data pre-processing: do it from within the C++ binary using PyTorch data or move it to your backend - pros and cons?
10 Likes

Moving inference to the CPU
(Jeremy Howard (Admin)) #3

I’d guess that for web apps best practice would be to keep it in Python - I’m assuming that the C++ approach is just for embedded type stuff, where you’d link in your lib?

3 Likes

(Francesco Gianferrari Pini) #4

Ciao to all.
I agree with @jeremy. I would concetrate at first on:

  • Robust serialization of models trained on GPU
  • Robust de-derialization of models in CPU and publishing on artifact repositories or other locations
  • Methods for exposing APIs
  • Best practices for performance monitoring
  • Dockerization of the model and for various technologies (Kubernetes and/or other serverless environments)

I would focus, for now, on Python and on fastai v1. It’s a long shot!

1 Like

(Simon Willison) #5

I figured out a rough and ready way to ship a simple model as an API.

https://cougar-or-not.now.sh/classify-url?url=https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Bobcat2.jpg/1200px-Bobcat2.jpg

The source code and Dockerfile are here: https://github.com/simonw/cougar-or-not

I’ve deployed it using inexpensive Docker based hosting from https://zeit.co/now - it’s a very simple Python API server built using https://www.starlette.io/

10 Likes

(Ilia) #6

Is it possible to use synchronous frameworks (like Flask) with Now platform?

0 Likes

(Henri Palacci) #7

Yes, leaving it in Python for webapps makes a ton of sense.

That said a couple points in favor of the C++ binaries:

  • lightweight: the C++ binary is 264k and model file is 86M; By comparison the official pytorch docker image is 1GB and @simonw’s docker image is 1.7GB.
  • it’s a binary, so should simplify your backend container and ops problems somewhat? Maybe?
  • same interop problems when you’re not using a Python backend (which tends to happen quite a bit :slight_smile:)
1 Like

(Simon Willison) #8

Yes, definitely. The beauty of Now is that it will run literally anything which can be built in a Docker container in a way that exposes a port. Basically any language that can run a web server works with it. Flask is absolutely fine.

1 Like

(Henri Palacci) #9

I’ve been meaning to give serverless a shot for a while and now seems super simple. Have you also tried Google cloud functions or AWS lambda?

0 Likes

(WG) #10

What plan are you using and what are you projecting your monthly costs to be?

Looking at Lambda and ZEIT and what I like about ZEIT is just being able to deploy my dockerized flask app as is. Will see how it goes, but right now it looks so much more straightforward than trying to deploy thing to AWS.

0 Likes

(Simon Willison) #11

No one has figured out how to run ASGI / async apps in Python on Lambda yet - everyone expects to you use WSGI there. I’m sure a WSGI/Flash version of my script would run there just fine. I’m hoping someone solves ASGI on Lambda soon.

https://docs.aws.amazon.com/lambda/latest/dg/limits.html are worth considering - it looks like Lambda can only be 50MB for the initial deployment package but you can then have it download up to 200MB of extra stuff from S3 - so you could deploy the API server and have it download the model when it starts up.

2 Likes

(Simon Willison) #12

I used to run a ton of projects on Zeit for a flat $15/month - that should be enough for this kind of API: https://zeit.co/account/plan

I actually upgraded to $50/month recently because I have so many other projects running there.

0 Likes

(WG) #13

I was wondering if the FREE tier would work for something like this, at least for QA? But unsure if the size restrictions will get in the way???

0 Likes

(Jeremy Howard (Admin)) #14

@simonw how about https://www.pythonanywhere.com/ ? They provide a shell, so should be able to install fastai there afaict. And it’s free. I haven’t tried it yet.

7 Likes

(William Horton) #15

I agree the C++ binary will always be more lightweight, but you can definitely deploy a fastai/Pytorch model with an image much smaller than 1.7GB. I deployed a Flask app serving a fastai model to a Heroku free dyno for a company hackathon and the two important steps were:

  • Copy just the fastai code I needed to run inference with the model (I was using an RNN). This avoids pulling in unnecessary dependencies.
  • Use the CPU-only Pytorch wheel. Not a lot of people know this trick but it really cuts down the size. It’ll show up if you go to the Pytorch site’s install wizard (https://pytorch.org/get-started/locally/) and choose Package=pip and CUDA=None. The install will look something like pip3 install http://download.pytorch.org/whl/cpu/torch-0.4.1-cp36-cp36m-linux_x86_64.whl, or you can also put the link in a requirements.txt

I don’t remember what the exact limit was but I got it down to somewhere like 256MB or 512MB.

8 Likes

(Jeremy Howard (Admin)) #16

OK I’ve now got a single-image prediction API that I think doesn’t suck too much.

Here’s the notebook:

The API is currently only in master. Let me know if you try it out - both successes and failures! :slight_smile:

17 Likes

(Francesco Gianferrari Pini) #18

One question: why is path needed in single_from_classes? But maybe it is too late in the night in Italy and I miss something….

1 Like

(Henri Palacci) #19

Thanks that’s super useful!

Apologizing in advance for the naive question - but in the web app world wouldn’t it be ok to have a synchronous API and just have it be called asynchronously from the frontend? (I’m not in familiar terrain here …)

[EDIT] Thought about this a little more - for high throughput workloads just async wouldn’t cut it as you would probably need to batch requests using Redis (or other) anyway? So you’d still be fine with a “semi-synchronous” backend with the Redis layer on top calling it?

0 Likes

#20

To know where the models is stored :wink:

1 Like

(Satish Kottapalli) #21

For a person whose only coding experience was in VBA in Excel, Access and Outlook, terms like Lambda, Serverless, Docker, Kubernetes are mighty scary. Brings back memories from when I first stated deep learning. Now that Jeremy has made deep learning training uncool, I wish there was something similar for deployment/production.

Hope this topic gets a small mention sometime during the course. Challenges, best practices etc.

19 Likes