Production Case Scenario: C++ inference from Python-trained models

(Davide Boschetto) #1

Hi there!
I have had this problem for a while, and never thought about asking here… why not?

I need to use some models (Convnet VGG-ResNet style)I trained in Python (either with Keras, fastai, tensorflow, cntk) for inference in a C++ production environment. The requirements are basically:

  • Speed (inference in milliseconds)
  • Small size of the library and its dependencies.

The approaches that I found are two, mainly, one the opposite of the other:

  • Use the original APIs in C++ of the respective model (pro: speed, con: huge dependencies)
  • Rewrite from scratch Conv and Dense layers (not so speedy…)

Had any of you somehow magically solved these issues?
Hoping for a feedback, thanks!

(Constantin) #2

You could go through Tensorflow Serving which has a C++ API as well and scales to arbitrary user numbers.
You’ll have to convert your model to TF’s protobuf format to make it work along with some other steps.
Check out these instructions by Saraj Raval:

(Constantin) #3

For smaller projects you can also use Flask.
Saraj has a video on that as well.

I deployed a little sentiment classifier model like in lesson 5 part1 to test this:

(Tuatini GODARD) #4

It really depends on the framework you use.
If you use pytorch for instance you may want to take a look at the ONNX format which allow you to export your pytorch (or model to a format readable by Caffe2. Caffe2 being production ready and is suited for IOT devices.
If you use Tensorflow/Keras you may want to export your models to its default format and then read them from the Tensorflow C++ API on your IOT device/production environment.
Ofc as you mentioned the con is having a huge dependency to the system lib. In my blog post I show how you can build tensorflow with the C++ interface as a standalone app (which means a portable project with no dependencies to the system libs excepted the basic ones which are in all systems like the libc). Hope it helps :slight_smile:

(Matthijs) #5

Another solution is to replace the VGGNet-style network with something that’s smaller and faster. Depthwise separable convolutions such as used in MobileNet are way faster than the regular convolutions + fully-connected layers that are used in VGGNet. (Of course, you’ll have to retrain your model.)

I do consulting work for deep learning on iOS and using MobileNet-style architectures it is possible to run deep neural networks in real-time (> 30 FPS) on iPhone 6 and up. I’m not sure what your environment is, and whether you have access to a GPU, but the choice of model architecture definitely makes a big impact on the speed.

(Davide Boschetto) #6

Sorry for the late reply, it’s been a tough week!
Thanks for the insights, I’ll try something this week and get back to you all!