Production Case Scenario: C++ inference from Python-trained models

Hi there!
I have had this problem for a while, and never thought about asking here… why not?

I need to use some models (Convnet VGG-ResNet style)I trained in Python (either with Keras, fastai, tensorflow, cntk) for inference in a C++ production environment. The requirements are basically:

  • Speed (inference in milliseconds)
  • Small size of the library and its dependencies.

The approaches that I found are two, mainly, one the opposite of the other:

  • Use the original APIs in C++ of the respective model (pro: speed, con: huge dependencies)
  • Rewrite from scratch Conv and Dense layers (not so speedy…)

Had any of you somehow magically solved these issues?
Hoping for a feedback, thanks!

1 Like

You could go through Tensorflow Serving which has a C++ API as well and scales to arbitrary user numbers.
You’ll have to convert your model to TF’s protobuf format to make it work along with some other steps.
Check out these instructions by Saraj Raval:

For smaller projects you can also use Flask.
Saraj has a video on that as well.

I deployed a little sentiment classifier model like in lesson 5 part1 to test this:

It really depends on the framework you use.
If you use pytorch for instance you may want to take a look at the ONNX format which allow you to export your pytorch (or model to a format readable by Caffe2. Caffe2 being production ready and is suited for IOT devices.
If you use Tensorflow/Keras you may want to export your models to its default format and then read them from the Tensorflow C++ API on your IOT device/production environment.
Ofc as you mentioned the con is having a huge dependency to the system lib. In my blog post I show how you can build tensorflow with the C++ interface as a standalone app (which means a portable project with no dependencies to the system libs excepted the basic ones which are in all systems like the libc). Hope it helps :slight_smile:


Another solution is to replace the VGGNet-style network with something that’s smaller and faster. Depthwise separable convolutions such as used in MobileNet are way faster than the regular convolutions + fully-connected layers that are used in VGGNet. (Of course, you’ll have to retrain your model.)

I do consulting work for deep learning on iOS and using MobileNet-style architectures it is possible to run deep neural networks in real-time (> 30 FPS) on iPhone 6 and up. I’m not sure what your environment is, and whether you have access to a GPU, but the choice of model architecture definitely makes a big impact on the speed.

Sorry for the late reply, it’s been a tough week!
Thanks for the insights, I’ll try something this week and get back to you all!

@DavideBoschetto Could you able to try? If so please share your experience.
I too have a same problem, train on one framework and inference on other. Thanks in advance!

Hey, thanks for the ping.

Sadly, Caffe2 or Tensorflow serving are still not an option for me, so I’ve been quite stuck on the problem and have since moved on towards other projects and problems which are fully Python, and I suggest if you can to do the same!

Usually it’s possible to convert the weights from one framework to another (I do this all the time). Typically you just need to transpose the weights.

However, there are some gotchas. For example, different frameworks use different ways to pad the images and if you don’t adjust for this then you may get different / slightly worse results after converting the weights.

Any update in 2019 ?