Productionizing models thread

Here’s a general thread for discussion on putting models in production.

In general, our suggestion is to do inference on CPU where possible.


Super interested in this - was playing around with the new PyTorch CPP features. If anyone’s interested, here’s how you can compile your model from lesson 1:

learn.load('stage-1') \\ or whatever your saved model is
example = torch.rand(2, 3, 224, 224) \\ dummy batch size, n_chanels, w, h = False \\ disable dropout
learn.model = learn.model.cpu() \\ move to CPU
traced_script_module = torch.jit.trace(learn.model, example)"")

This saves the file that you can use in C++ code to build a binary as shown in PyTorch C++ Documentation. You’ll get warnings due to dropout being in the model, but it should still work.

As a side note, the example worked for me out of the box without needing to use XCode or VSIde or some other C++ nightmare. Just remember to upgrade your cmake using
conda install -c anaconda cmake

I still have a ton of questions about this though:

  • now you have your compiled model - what are best practices? Call your binary from whatever backend you’re using?
  • data pre-processing: do it from within the C++ binary using PyTorch data or move it to your backend - pros and cons?

I’d guess that for web apps best practice would be to keep it in Python - I’m assuming that the C++ approach is just for embedded type stuff, where you’d link in your lib?


Ciao to all.
I agree with @jeremy. I would concetrate at first on:

  • Robust serialization of models trained on GPU
  • Robust de-derialization of models in CPU and publishing on artifact repositories or other locations
  • Methods for exposing APIs
  • Best practices for performance monitoring
  • Dockerization of the model and for various technologies (Kubernetes and/or other serverless environments)

I would focus, for now, on Python and on fastai v1. It’s a long shot!

1 Like

I figured out a rough and ready way to ship a simple model as an API.

The source code and Dockerfile are here:

I’ve deployed it using inexpensive Docker based hosting from - it’s a very simple Python API server built using


Is it possible to use synchronous frameworks (like Flask) with Now platform?

Yes, leaving it in Python for webapps makes a ton of sense.

That said a couple points in favor of the C++ binaries:

  • lightweight: the C++ binary is 264k and model file is 86M; By comparison the official pytorch docker image is 1GB and @simonw’s docker image is 1.7GB.
  • it’s a binary, so should simplify your backend container and ops problems somewhat? Maybe?
  • same interop problems when you’re not using a Python backend (which tends to happen quite a bit :slight_smile:)

Yes, definitely. The beauty of Now is that it will run literally anything which can be built in a Docker container in a way that exposes a port. Basically any language that can run a web server works with it. Flask is absolutely fine.

1 Like

I’ve been meaning to give serverless a shot for a while and now seems super simple. Have you also tried Google cloud functions or AWS lambda?

What plan are you using and what are you projecting your monthly costs to be?

Looking at Lambda and ZEIT and what I like about ZEIT is just being able to deploy my dockerized flask app as is. Will see how it goes, but right now it looks so much more straightforward than trying to deploy thing to AWS.

No one has figured out how to run ASGI / async apps in Python on Lambda yet - everyone expects to you use WSGI there. I’m sure a WSGI/Flash version of my script would run there just fine. I’m hoping someone solves ASGI on Lambda soon. are worth considering - it looks like Lambda can only be 50MB for the initial deployment package but you can then have it download up to 200MB of extra stuff from S3 - so you could deploy the API server and have it download the model when it starts up.


I used to run a ton of projects on Zeit for a flat $15/month - that should be enough for this kind of API:

I actually upgraded to $50/month recently because I have so many other projects running there.

I was wondering if the FREE tier would work for something like this, at least for QA? But unsure if the size restrictions will get in the way???

@simonw how about ? They provide a shell, so should be able to install fastai there afaict. And it’s free. I haven’t tried it yet.


I agree the C++ binary will always be more lightweight, but you can definitely deploy a fastai/Pytorch model with an image much smaller than 1.7GB. I deployed a Flask app serving a fastai model to a Heroku free dyno for a company hackathon and the two important steps were:

  • Copy just the fastai code I needed to run inference with the model (I was using an RNN). This avoids pulling in unnecessary dependencies.
  • Use the CPU-only Pytorch wheel. Not a lot of people know this trick but it really cuts down the size. It’ll show up if you go to the Pytorch site’s install wizard ( and choose Package=pip and CUDA=None. The install will look something like pip3 install, or you can also put the link in a requirements.txt

I don’t remember what the exact limit was but I got it down to somewhere like 256MB or 512MB.


OK I’ve now got a single-image prediction API that I think doesn’t suck too much.

Here’s the notebook:

The API is currently only in master. Let me know if you try it out - both successes and failures! :slight_smile:


One question: why is path needed in single_from_classes? But maybe it is too late in the night in Italy and I miss something….

1 Like

Thanks that’s super useful!

Apologizing in advance for the naive question - but in the web app world wouldn’t it be ok to have a synchronous API and just have it be called asynchronously from the frontend? (I’m not in familiar terrain here …)

[EDIT] Thought about this a little more - for high throughput workloads just async wouldn’t cut it as you would probably need to batch requests using Redis (or other) anyway? So you’d still be fine with a “semi-synchronous” backend with the Redis layer on top calling it?

To know where the models is stored :wink:

1 Like

For a person whose only coding experience was in VBA in Excel, Access and Outlook, terms like Lambda, Serverless, Docker, Kubernetes are mighty scary. Brings back memories from when I first stated deep learning. Now that Jeremy has made deep learning training uncool, I wish there was something similar for deployment/production.

Hope this topic gets a small mention sometime during the course. Challenges, best practices etc.