Productionizing models thread

jeremy · October 28, 2018, 5:38pm

Here’s a general thread for discussion on putting models in production.

In general, our suggestion is to do inference on CPU where possible.

henripal · October 28, 2018, 6:43pm

Super interested in this - was playing around with the new PyTorch CPP features. If anyone’s interested, here’s how you can compile your model from lesson 1:

learn.load('stage-1') \\ or whatever your saved model is
example = torch.rand(2, 3, 224, 224) \\ dummy batch size, n_chanels, w, h
learn.model.training = False \\ disable dropout
learn.model = learn.model.cpu() \\ move to CPU
traced_script_module = torch.jit.trace(learn.model, example)
traced_script_module.save("model.pt")

This saves the model.pt file that you can use in C++ code to build a binary as shown in PyTorch C++ Documentation. You’ll get warnings due to dropout being in the model, but it should still work.

As a side note, the example worked for me out of the box without needing to use XCode or VSIde or some other C++ nightmare. Just remember to upgrade your cmake using
conda install -c anaconda cmake

I still have a ton of questions about this though:

now you have your compiled model - what are best practices? Call your binary from whatever backend you’re using?
data pre-processing: do it from within the C++ binary using PyTorch data or move it to your backend - pros and cons?

jeremy · October 28, 2018, 7:19pm

I’d guess that for web apps best practice would be to keep it in Python - I’m assuming that the C++ approach is just for embedded type stuff, where you’d link in your lib?

gianferrarif · October 28, 2018, 9:51pm

Ciao to all.
I agree with @jeremy. I would concetrate at first on:

Robust serialization of models trained on GPU
Robust de-derialization of models in CPU and publishing on artifact repositories or other locations
Methods for exposing APIs
Best practices for performance monitoring
Dockerization of the model and for various technologies (Kubernetes and/or other serverless environments)

I would focus, for now, on Python and on fastai v1. It’s a long shot!

simonw · October 29, 2018, 12:03am

I figured out a rough and ready way to ship a simple model as an API.

https://cougar-or-not.now.sh/classify-url?url=https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Bobcat2.jpg/1200px-Bobcat2.jpg

The source code and Dockerfile are here: https://github.com/simonw/cougar-or-not

I’ve deployed it using inexpensive Docker based hosting from https://zeit.co/now - it’s a very simple Python API server built using https://www.starlette.io/

devforfu · October 29, 2018, 4:33am

Is it possible to use synchronous frameworks (like Flask) with Now platform?

henripal · October 29, 2018, 4:35am

Yes, leaving it in Python for webapps makes a ton of sense.

That said a couple points in favor of the C++ binaries:

lightweight: the C++ binary is 264k and model file is 86M; By comparison the official pytorch docker image is 1GB and @simonw’s docker image is 1.7GB.
it’s a binary, so should simplify your backend container and ops problems somewhat? Maybe?
same interop problems when you’re not using a Python backend (which tends to happen quite a bit )

simonw · October 29, 2018, 12:27pm

Yes, definitely. The beauty of Now is that it will run literally anything which can be built in a Docker container in a way that exposes a port. Basically any language that can run a web server works with it. Flask is absolutely fine.

henripal · October 29, 2018, 1:26pm

I’ve been meaning to give serverless a shot for a while and now seems super simple. Have you also tried Google cloud functions or AWS lambda?

wgpubs · October 29, 2018, 5:36pm

What plan are you using and what are you projecting your monthly costs to be?

Looking at Lambda and ZEIT and what I like about ZEIT is just being able to deploy my dockerized flask app as is. Will see how it goes, but right now it looks so much more straightforward than trying to deploy thing to AWS.

simonw · October 29, 2018, 6:49pm

No one has figured out how to run ASGI / async apps in Python on Lambda yet - everyone expects to you use WSGI there. I’m sure a WSGI/Flash version of my script would run there just fine. I’m hoping someone solves ASGI on Lambda soon.

https://docs.aws.amazon.com/lambda/latest/dg/limits.html are worth considering - it looks like Lambda can only be 50MB for the initial deployment package but you can then have it download up to 200MB of extra stuff from S3 - so you could deploy the API server and have it download the model when it starts up.

simonw · October 29, 2018, 6:50pm

I used to run a ton of projects on Zeit for a flat $15/month - that should be enough for this kind of API: https://zeit.co/account/plan

I actually upgraded to $50/month recently because I have so many other projects running there.

wgpubs · October 29, 2018, 7:25pm

I was wondering if the FREE tier would work for something like this, at least for QA? But unsure if the size restrictions will get in the way???

jeremy · October 29, 2018, 8:26pm

@simonw how about https://www.pythonanywhere.com/ ? They provide a shell, so should be able to install fastai there afaict. And it’s free. I haven’t tried it yet.

wdhorton · October 29, 2018, 11:20pm

I agree the C++ binary will always be more lightweight, but you can definitely deploy a fastai/Pytorch model with an image much smaller than 1.7GB. I deployed a Flask app serving a fastai model to a Heroku free dyno for a company hackathon and the two important steps were:

Copy just the fastai code I needed to run inference with the model (I was using an RNN). This avoids pulling in unnecessary dependencies.
Use the CPU-only Pytorch wheel. Not a lot of people know this trick but it really cuts down the size. It’ll show up if you go to the Pytorch site’s install wizard (https://pytorch.org/get-started/locally/) and choose Package=pip and CUDA=None. The install will look something like pip3 install http://download.pytorch.org/whl/cpu/torch-0.4.1-cp36-cp36m-linux_x86_64.whl, or you can also put the link in a requirements.txt

I don’t remember what the exact limit was but I got it down to somewhere like 256MB or 512MB.

jeremy · October 30, 2018, 12:23am

OK I’ve now got a single-image prediction API that I think doesn’t suck too much.

Here’s the notebook:

github.com

fastai/fastai_docs/blob/master/dev_nb/104c_single_image_pred.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example of single image prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline"
   ]
  },

This file has been truncated. show original

The API is currently only in master. Let me know if you try it out - both successes and failures!

gianferrarif · October 30, 2018, 12:40am

One question: why is path needed in single_from_classes? But maybe it is too late in the night in Italy and I miss something….

henripal · October 30, 2018, 1:18am

Thanks that’s super useful!

Apologizing in advance for the naive question - but in the web app world wouldn’t it be ok to have a synchronous API and just have it be called asynchronously from the frontend? (I’m not in familiar terrain here …)

[EDIT] Thought about this a little more - for high throughput workloads just async wouldn’t cut it as you would probably need to batch requests using Redis (or other) anyway? So you’d still be fine with a “semi-synchronous” backend with the Redis layer on top calling it?

sgugger · October 30, 2018, 1:55am

To know where the models is stored

skottapa · October 30, 2018, 5:47am

For a person whose only coding experience was in VBA in Excel, Access and Outlook, terms like Lambda, Serverless, Docker, Kubernetes are mighty scary. Brings back memories from when I first stated deep learning. Now that Jeremy has made deep learning training uncool, I wish there was something similar for deployment/production.

Hope this topic gets a small mention sometime during the course. Challenges, best practices etc.