Help with lightweight object detection neural network

Hi all

I’m working on a very lightweight image classifier. It will need to run “as fast as possible” on a relatively slow platform (raspberry pi in my case). The goal is to recognize only one type of object - say, humans - and return the location of the object in the picture

I have a couple questions which I’d like to hear your thoughts on, before I go and collect data etc in case your answers affect my next steps. My questions are,

  1. Seeing as I don’t need to draw a box around the object (just find its location) does anyone have any thoughts on simply returning an (x, y) location as two outputs of the network, with a third being the confidence that the object is even in the frame? (similar to the first lesson of fastai2) (ie outputs of [confidence] [x] [y])
  2. While there will mostly be only one object in the image at a time, I want to properly handle cases when there are a multiple objects. In the lesson, Jeremy found that the network would point at the “average” of (large) objects in the picture - however, I hope that if I label data correctly I could make it indicate the largest object only (likely the closest one). Your thoughts on this? Another option would be to choose some sort of weighted average of the objects in the frame as the ground truth
  3. Does anyone have any ‘tips’ for neural network architectures which fit the requirement of ‘lightweight, localization, pretrained enough that I need relatively little data’?
  4. Finally, if someone knows of a dataset of localized humans (/dogs/cats/common thing which I can test on outside of simulation) taken from a CCTV-style viewpoint (ie from a slightly raised viewpoint, where the object takes only a relatively small portion of the frame) I would greatly appreciate it! It’s not my actual application, but it would make prototyping and answering my questions myself much easier!

Thank you in advance! I’ll try to answer these later on if there isn’t much response

I should also mention that, intuitively, I think that running the neural network over portions of the frame multiple times would be a bit too compute intensive. However, if it makes the neural network significantly simpler, it could also be an option (instead of having location as an output). It would also solve the issue of multiple objects in the frame. Any thoughts on this? (personal experiences welcome)

MobileNetV2 is a pretty decent neural network, I use it on iOS a lot (with GPU acceleration). Combined with SSDLite it’s a good object detector.

I wrote a blog post about how it works:

Perhaps it’s worth getting one of those new Edge TPU thingies that was just announced?

Fantastic article - thank you. Really nice blog in general. I think I’ll go with your advice and try that network

The Edge TPU looks interesting. I actually already have another plug-in accelerator (movidius neural compute stick) from someone else which should help speed things up

Hi all, thought to check the relevance of MobileNet + SSD in today’s world… or is it recommended to use another architecture set for similar mobile app (lightweight object detection)?