Fastai / PyTorch in Production

One of the things that I haven’t seen discussed much on the forums is how to implement all these awesome models in production. After several hours of research, I landed on a microservice approach, using the Serverless Framework with AWS Lambda, and loading the weights in from s3. Utilizing some of the handy features in Serverless, you can get around the AWS lambda deployment limitations. Also, if you set up the handler to load the model into memory as a decorator, and keep your lambda warm, it will persist between requests. So the weights only need to be loaded in on the first request.

This method has been working well for me so far, is incredibly cheap to run, and very easy to scale. So wanted to share for anyone who is interested, but also curious to hear some of the other approaches people are using.

There is now a repo demonstrating this approach.


I’m curious to hear how you perform feature engineering and transform new data to a generator/vector before getting predictions.
In the case of categorical data I had to resort to keeping the original training datatable in memory. Otherwise I could not use apply_cats().

For a service beta I’m planning to use Flask and run it on an azure VM. Though the entire development environment has to be installed first.

@SnowyRanger it varies between tasks, but for the most part with image classification, I just store the labels in a json file, and load it in when I load the weights from s3. For text, I set up the tokenizer as it’s own lambda, so it can be used by other models as well. Then I just save the itos and stoi for each field, encode the tokenized text when it returns from the lambda, pass it into the model and decode the output back into text.

Basically, do whatever you have to do to get the data into an array, then convert it into a tensor, and pass it into your model. Get the data from the output tensor, and return the decoded response.

Also, the only packages you really need to have in production are PyTorch, dill, numpy, and if you’re doing images, whatever library you use for that. So your env setup shouldn’t have to be too crazy. I actually found it to be a really helpful exercise to go thru and pull out only the necessary functions from fastai to make the model work.

That’s a good point about pulling only the necessary functions fastai. In my mind there was a elegant approach where feature engineering could be done automatically after training the model -but no.

Take a look at this thread - Exposing DL models as api's/microservices

@daveluo has given a detailed account of deploying FastAI / PyTorch using Flak App and you can may be able to use his CocoApp as a template for your own work.


Not too sure what kind of feature engineering you are talking about in production, if you had to preprocess the data before training on it, just copy those methods over to production and apply them when the request comes in.

I’d love to hear more details about this, and I think a blog post outlining the approach would be extremely popular. I tried lambda deployments of models before, but the model weights and required packages were over the 50 meg instance limit that they had and it took a lot of work to pair it down. The S3 solution seems okay given that any service in production is going to have long uptimes.

My last model was on TF-serving which is a whole other stack and wasn’t easy to convert/deploy but I’m working primarily in pytorch now and we don’t have many engineers who have ml deployment experience so I’d love to continue this conversation.

The paired down for production sounds like an interesting project/repository of it’s own. Do you have a base that you start from when deploying? And I’m assuming you add whole files and not just functions/classes?

I haven’t put together a base template yet, everything gets defined for the most part in the serverless.yml, so I just copy and change it from one project to another. I’m not the best writer, but I’d be more than happy to put together a base template on github, then if someone would be willing to help out with the technical writing we could put together a blog post outlining the approach.

And to answer your last question, it’s usually whole files, unless I only need a function or two. Most of the time ends up being 2-3 files.

I’d be up for collaborating on that. I’ve got a blog post out on the tensorflow serving side.

I’m curious about the CPU vs GPU aspect as well. I generally use CPU for inference, but for large batches and images the GPU definitely outperforms. I’m guessing Lambda limits you to CPU?

I haven’t used serverless before. I’ll have to spend some time looking into it.

Has anyone tried SageMaker? I’ve used it with a couple of models using sklearn, but I haven’t yet tried deploying something with fastai/pytorch.

1 Like

Awesome! I’ll start putting it together this weekend.

I’m pretty sure lambda doesn’t support GPU. In my experience tho, I haven’t found the prediction times in production to be much of a bottleneck at all. It takes longer to load in the image, or to tokenize the text than it does to actually get the predictions. Images usually take about 1-1.2s/request (depending on how you load in the image), and text is usually between 250-500ms/request. So say you sent 1000 requests, each responsible for returning a single prediction, you’ll get results faster than making 100 requests to return 10 predictions at a time. And, because you pay per GB/s, the cost isn’t that much more.

1 Like

Hmmm, that’s not nearly as performant as I’d expect. We run the tensorflow image classifier on gpu via tf-serving with some preprocessing during input to normalize images and our throughput is 249.6 req/s (4.0 ms/req) on a p2.xlarge. That’s for a 224x224 sized image. On a cpu mx.4xl we were at around 100 req/s.

I wonder if it’s worth looking at converting the model to caffe2 via onnx in order to try to improve performance.


Wow that’s fast, and that includes the time to download each image? And is that server response time? Or client response time? Would definitely be interested in converting the model to increase performance

Yeah, the image load is what takes most of the time. We’ve got an image api that we call to get the image in the size we want, but we do the normalization ourselves. The time to process a single image is 4ms. Time in the model is actually closer to 50us.

@Even Got it, yeah the image load is definitely the bottleneck. If you just need to preprocess the image, and return a prediction, it shouldn’t take more than 25-50ms on the lambda. It’s definitely way cheaper to have lambdas running per request vs having p2s running 24x7, so for me at least, it’s worth the performance hit for the savings. But, if we can convert the model, and increase performance on the lambda, that would be incredible. The image classifier I have set up, receives the images as they are stored in our database, which is 1280x1280. So requesting the image, downloading, and resizing takes over 90% of the request time. The server response time on the lambda is usually between 700-900ms. Also, just realized, I wrote 1-2s/request before, I meant to write 1-1.2s/request.


I’ve heard that caffe is meant to be the production deployment path for pytorch. The link I posted earlier was also a lambda deployment and I’m hoping to follow it using the same architecture we deployed for image classification just to get a sense of performance differences.

I’d like to figure this out for either way and I think it’s worth putting together a blog post / some documentation around this topic. As the community matures deployment of models is going to become a very important issue and it’s one I’m facing now.

In terms of the lambda, when you say the response time is 700-900ms does that include model load? I’m not sure how familiar you are with lambda, but there’s a way to do the model load on startup only, which if that’s what your doing should massively improve performance when running at scale.

I’ve tried it and wasn’t impressed. If you want to do the few things is does directly like xgboost and you don’t have engineering resources then it’s beneficial, but to do anything outside their little sandbox essentially involves building your own docker and then paying amazon a 50% premium on resources to run it via sagemaker.

If the develop onnx support for it it might become worthwhile, assuming they can make it performant, but right now I found it pretty frustrating to work with.

1 Like

It seems like Nvidia’s TensorRT might be worth looking into as well in terms of inference of ONNX models in production, although that won’t work with serverless. But it seems like it’s designed for GPU inference at scale.

Yeah, if you look back at initial post, I mentioned using a decorator to persist the model in memory between requests. To clear everything up, the 700-900ms is time it takes the lambda to parse the URL from a query string, download the image from the URL (1280x1280, ~1.5mb in my case), resize & preprocess the image input, return a prediction, and format the result into a JSON response.

Here’s how my handler is set up.

class SetupModel(object):
	model = model()
	labels = list(label_indices_dict.values())
	def __init__(self, f):
		self.f = f
		file_path = f'/tmp/{state_dict_key_name}'
		s3.download_file(bucket_name, state_dict_key_name, file_path)
		state_dict = torch.load(file_path, map_location=lambda storage, loc:storage)

	def __call__(self, *args, **kwargs):
		return self.f(*args, **kwargs)

def build_pred(label_idx, log, prob):
	label = SetupModel.labels[label_idx]
	return dict(label=label, log=float(log), prob=float(prob))

def predict(content):
	batch = []
	with as im:
		im = im.convert('RGB')
	inp = torch.autograd.Variable(torch.stack(batch, dim=0), volatile=True)
	return SetupModel.model(inp)

def handler(event, _):
		img_qs = event['queryStringParameters'].get('image_url')
		img_url = urllib.parse.unquote_plus(img_qs)
		img = urllib.request.urlopen(img_url)
		out = predict(img).data.numpy()
		logs, probs = out[-1], np.exp(out)[-1]

		n_results = int(event['queryStringParameters'].get('top_k', 0))
		n_results = min(max(n_results, 1), len(SetupModel.labels))
		top_k = out.topk(n_results, sorted=True)[-1][0]
		preds = [build_pred(i, logs[i], probs[i]) for i in list(top_k)]

		response_body = dict(predictions=preds)
		response = dict(statusCode=200, body=response_body)

	except Exception as e:
		response_body = dict(error=str(e), traceback=traceback.format_exc())
		response = dict(statusCode=500, body=response_body)

	response['body'] = json.dumps(response['body'])
	return response
1 Like

For anyone interested, I put together a base template showcasing this technique for a single label image classifier. We will be using this as our jumping off point for anyone who is interested in contributing or following along.