Exporting a model for local inference mode

I’m trying to implement a prototype of a gaze-controlled program. That is, the user interacts with the program by looking in certain directions/ at certain spots. I (somewhat) successfully trained a model to recognize the direction in which the user is looking from a webcam shot.

However, how would I go about using such a model in a local app? More precisely, I have these two questions:

  1. Is it possible to run a fastai-model on a windows machine? I suspect the answer is no, but maybe someone has an idea. I know about learn.export(), but when loading such a learner on my windows machine it throws an exception “cannot instantiate ‘PosixPath’ on your system”.

  2. Perhaps more importantly: Is there a special “light-weight inference mode” I can put the model in? When I run learn.predict(img) on paperspace it takes quite some time (over 1 second) to run the prediction on a single image. If my goal is realtime-classification, how should I use the model?

Thanks in advance :slight_smile:
Oliver

In terms of light weight take a look here:

On CPU single image inference I could bring it down to 732ms via the techniques mentioned.

1 Like

Wow, that was a quick answer.

Thank you for the link. 732ms is a lot higher than I imagined it would be to be honest…
How can object detection models like YOLO run at a reasonable fps, yet a simple resnet barely manages 1fps? Or is that comparison flawed because those models work fundamentally different?

I think it’s moreso the frameworks rather than the models. But that’s certainly a possibility too. Take EfficientNet, they’re hyper-optimized for classification specifically and not other tasks, and so while they’re great at one particular thing, they may not be as efficient in others. In terms of how they get it to be so real time, that I’m unsure of. I’d need to read more into it

Have you made any progress on item 1 here? I’ve run fastai1 models on windows before so I think the answer is ‘yes’, but with version 2 I’m running into the same problem you describe. It seems similar to this v1 issue: https://github.com/fastai/fastai/issues/1482

If I understand correctly, the workaround is to use learn.save instead of learn.export. However, when loading the saved model with learn.load, you need to separately define the databunch (which translates into version 2 dls?). In version 1, there’s the ImageDataBunch.single_from_classes method to make a learner that’s ready to load a previously saved model. Anyone know whether there’s an analogous strategy available in version 2?

@oneironaut I have a workaround for the issue in question 1:

  • After training the model, save it instead of exporting
    learn.save('model', pickle_protocol=4)
  • Copy the .pkl over to Windows
  • Recreate the dataloaders object. I did this using
    dls = ImageDataLoaders.from_folder(datapath, valid='valid', item_tfms=Resize(256), num_workers=0, bs=32)
    but I suspect there’s a less kludgey way to do this part
  • Recreate the learner with the same architecture and load the saved weights
    learn = cnn_learner(dls, resnet50)
    learn.load('model')
  • At this point, the model can be exported and loaded straightforwardly in Windows in the future
    learn.export('model_win', pickle_protocol=4)
3 Likes

Wow, thank you for the detailed walkthrough! I’ll try it out over the weekend :slight_smile:

I have 170 - 300 ms prediction time* on a resnet50.

  • on windows
  • no gpu, i5 cpu
  • I used export() / load_learner()

The key is using fastai2 in a linux vm (I use WSL1, highly recommned), but using windows to interface with your webcam. See here app here: https://github.com/sutt/fastai2-dev/tree/master/chess-classification-hw/app

*:

1 Like

@sut you can probably speed it up even further utilizing torch.jit (what was linked earlier :slight_smile: ) because I saw an increase in speed on CPU of about 12%

1 Like

Exporting to ONNX format and using the ONNX Runtime, an open source cross platform efficient runtime for DNN inferencing is another option to consider. It supports CPU and GPU inferencing. The runtime is C++ based but has language interfaces in Python, C#, Java etc so it can integrate well with your application.

Here is some instructions on how to deploy an ONNX model to Azure and run predictions using the relatively lightweight ONNX Runtime. The main predict code is quite generic and can be used on any platform (Linux, Windows, Mac) for local inference . You dont need Pytorch or fast.ai library if you are using ONNX. So the deployment package is much smaller.

You can export your fast.ai/pytorch models graph / weights to ONNX by using the following snippet (in the Bear detector example) after the training.

dummy_input = torch.randn(1, 3, 224, 224, device='cuda')
onnx_path =  "./model.onnx"
torch.onnx.export(learn.model, dummy_input, onnx_path, verbose=False)

Since this is advanced, you can also decrease model memory footprint and complexity using model distillation from the bag of tricks paper: https://arxiv.org/abs/1812.01187

Disclaimer: I personally haven’t tested this to determine how well decreasing parameters decreases inference time.

I am not sure what your bottleneck is at the moment.

Putting this line in your code torch.backends.cudnn.benchmark =True, tends to speed up inference after you pass in a few examples as a warm up.

So many great suggestions, you people are awesome!

It will take me quite some time to actually get to the point where I need this locally though. When I get to try it out I’ll let you know what worked :slight_smile:

To do real-time predictions you need a model that is small enough so that it runs fast enough on your hardware.

The primary indicator of how fast a model will be is the number of memory accesses it does (not the number of FLOPS, although the two are often correlated). On mobile hardware, for example, a model with a ResNet backbone is going to be very slow.

It also helps to have everything “warmed up” already, i.e. don’t call a predict.py script that loads the model for every single request. You need to have this constantly running as a process on your machine. New requests go into a queue. The inference process reads from this queue and sends the results back when they are ready.

You can’t really use batches for real-time use, which means this will run less efficiently than during training. But you can often run multiple requests at once, i.e. already prepare the next request while the current one is still being processed by the GPU.

1 Like

Some quick ideas:
• Experiment with CPU/GPU latency differences. Although a GPU won’t speed you up by much running one image at a time
• Make sure your code isn’t loading the model multiple times. Load the model once when your program starts.

Some more involved ideas:

Since latency is a key issue here, you should profile your model and see where the latency is coming from. You should also determine the FPS your model needs to run at to be useful.

When you call learn.predict, basically two things happen. First there’s the preparation of the image. It’s converted to a tensor, resized and packaged into a DataBlock object. Then you run the forward pass of your model.

Most likely the model forward pass is the rate limiting step. Here’s the things I would try:

  1. Use smaller images for inference. Running inference on a 256x256 image is slower than resizing the image to 128x128 and running inference at that scale. Determine how large the images need to be for your model to reach your target performance.
  2. Use a smaller model. Maybe a resnet18 works as well as a resnet34
  3. Convert your model to a compiled format. Use something like TorchScript or ONNX to convert your model from Python to C++.
  4. Model downsizing potpourri. Look into techniques like weight pruning, weight quantization and model distillation

There’s also a scenario where you’re getting some overhead from fastai. As I understand it, when you call learn.predict, the image is passed into a DataBlock object. You have to create all these fastai objects on every single prediction. All you actually need to do is resize the image, convert it to a Pytorch tensor, and normalize using whatever stats you used for training. You could write your own function to do that, then load your model as a pure Pytorch model and remove fastai from the process.

3 Likes

100% right on the overhead. If you narrow down the transforms and just apply them as it would expect to during the validation set, you can usually save on time there too (as it’s done on the fly).

I proved this here in my most recent find. With a GPU I was able to get a resnet18 to be close to real time (~40ms), and this can very easily be modified for cpu as well

(For some other numbers a resnet18 on CPU and single image using this method I got 74.7ms)

To squeeze even more speed out of your model you can also remove certain layers completely, such as batchnorm. If the order is conv -> BN -> ReLU then you can remove batchnorm by combining its weights with the conv layer. (I plan to write a blog post about this soon.) It’s only a small gain but if you have many BN layers, it still adds up.

1 Like

Hii @tank13 @oneironaut ,
I ran into the same issue and did the same way as you’ve suggested- saved the model, recreated dls and learn, loaded the saved model and then exported with learn.export(‘model.pkl’).

The error-

is because we are running the server on windows system so have to use PureWindowsPath() as mueller suggested
path=PureWindowsPath(’./artifacts’)
model=load_learner(path/‘learn.pkl’)

And this throws an error- “PureWindowsPath has no attribute ‘seek’. You can only torch.load from a file that is seekable.”

How did you overcome this error? Kindly help me out.

Thank you,

Did you try without specifying the path as a PureWindowsPath? Just using a regular Path, I did not encounter the error you describe.

I believe the “cannot instantiate ‘PosixPath’ on your system” error arises because model.export includes information about the model’s path, so if it was created in Unix then the relevant path is a PosixPath, which then is incompatible with Windows. However if the model you’re loading was created in Windows, then this isn’t an issue, and you don’t need to do anything special for the path to work.

Hope this helps :slight_smile:

1 Like

Hii @tank13 ,
Thanks for the reply.

Ohh… so saving and loading the model should erase that path information. I did the same like below…
learn.save('model')
recreated dataloaders and learner …
learn.load('model')
learn.export('model_win.pkl')

But still I get that ‘cannot instantiate posixpath’ error. Here’s the server side code…
path=Path('artifacts')
__model=load_learner(path/'model_win.pkl')

What could have gone wrong…? :confused:

I trained a multi-category image classifier on colab and exported the same. I am trying to do inference on a standalone windows machine. Loading the learner:
inf_learner=load_learner(‘model.pkl’) is giving the following error:

cannot instantiate ‘PosixPath’ on your system

I used same technique with fastai1 and it used to work fine.
Please let me know where I am missing out on detail.

Regards

1 Like