I am currently working on a small toy-project that involves object detection as one of the steps.
Currently, I amusing a pre-trained Faster-RCNN from Detectron2 with ResNet-101 backbone.
I wanted to make an MVP and show it to my colleagues, so I thought of deploying my model on a CPU machine. Detectron2 can be easily converted to Caffe2 (DOCS) for the deployment.
I measured the inference times for GPU and the CPU mode. The inference time of the original Detectron2 model using PyTorch and GPU is around 90ms on my RTX2080 Ti.
The converted model on CPU (i9 9940X) and using Caffe2 API took 2.4s. I read that the Caffe2 is optimized for CPU inference, so I am quite surprised by the inference time on CPU. I asked about this situation on the Detectron2 GitHub and I got an answer like: „Expected inference time of R-50-FPN Faster R-CNN on a 8 core CPU is around 1.9s. Usually, ResNets are not used on CPUs.”
There is my question, how such deep learning solutions for Computer Vision should be deployed in the real-world? I read somewhere that Facebook does use Caffe2 for their production models as CPUs are super cheap compared to GPUs (of course they are), but the difference in the running time is really huge. Using CPU for object detection seems useless for any real-time application.
Should I use some other architecture, which does not include ResNet or Faster-RCNN (like YOLO v4/v3, SSD, etc.)? Or maybe the original GPU-trained model should be converted to ONNX and then used in other more CPU-optimized frameworks such as OpenVINO? Or there are some other tweaks such as quantization, pruning, etc. that are necessary to boost the CPU-inference efficiency of production models?
I know that this is just a toy-project (for now at least), I can use GPU for inference (quite costly in real applications?) or just use other architecture (but sacrifice performance). I am just wondering what is the go-to solution for real-world systems.
Thanks in advance for sharing your knowledge and experience. I will be grateful for any hints!
Should I use some other architecture, which does not include ResNet or Faster-RCNN (like YOLO v4/v3, SSD, etc.)?
Yes. Perhaps you might want to search neural networks for cell phones. Resnet was used to get maximum performance on ImageNet, not necessarily inference time on CPU. There is generally a tradeoff between accuracy and inference time. I recommend trying out other architectures. For example, use a pretrained mobilenet (PyTorch has this already). If you find something fast for cellphone applications, it’s probably fast for cpus. Here’s a paper on the newest mobilenet v3 https://arxiv.org/abs/1905.02244 .
For object detection, I think you might be confusing two parts. There is a feature extraction backbone (resnet50, mobilenet, etc.), and then there is the object detection algorithm (Faster-RCNN, YOLO, SSD). People will mix match these two as they are generally independent.
I believe I do not confuse two parts in object detection. I am aware that most of these solutions are two-stage. I must admit, I have already discarded all MobileNet-like backbones due to their insufficient performance compared to ‘full-fledged’ backbones.
Of course one can just change the arcitecture/algorithm and it is ok, but I am more interested how the real-world production models are deployed. I am wondering if solutions that include object detection are deployed on GPU servers or they are run (for cost efficiency) on CPU machines, but with appropriate optimization and tweaks.
I doubt that let’s say Facebook object detection solutions are run with MobileNet backbone (I may be wrong!), so they use either GPU servers to run accurate models or they deploy on CPUs in highly optimized regime.
Mobilenet architectures have lower accuracy. You could use one stage detectors with regular batckbones. You could try centernet V2. It uses detectron2 and authors released lite architectures.
Please, I would like some guidelines, I am trying to understand how i can perform Object Extraction from satellite Imagery and yet I don’t where to start. Please, I would like some guidelines to extract specific information from object detected from the satellite imagery. I will be very grateful.