Hi all
I’m working on a very lightweight image classifier. It will need to run “as fast as possible” on a relatively slow platform (raspberry pi in my case). The goal is to recognize only one type of object - say, humans - and return the location of the object in the picture
I have a couple questions which I’d like to hear your thoughts on, before I go and collect data etc in case your answers affect my next steps. My questions are,
- Seeing as I don’t need to draw a box around the object (just find its location) does anyone have any thoughts on simply returning an
(x, y)
location as two outputs of the network, with a third being the confidence that the object is even in the frame? (similar to the first lesson of fastai2) (ie outputs of[confidence] [x] [y]
) - While there will mostly be only one object in the image at a time, I want to properly handle cases when there are a multiple objects. In the lesson, Jeremy found that the network would point at the “average” of (large) objects in the picture - however, I hope that if I label data correctly I could make it indicate the largest object only (likely the closest one). Your thoughts on this? Another option would be to choose some sort of weighted average of the objects in the frame as the ground truth
- Does anyone have any ‘tips’ for neural network architectures which fit the requirement of ‘lightweight, localization, pretrained enough that I need relatively little data’?
- Finally, if someone knows of a dataset of localized humans (/dogs/cats/common thing which I can test on outside of simulation) taken from a CCTV-style viewpoint (ie from a slightly raised viewpoint, where the object takes only a relatively small portion of the frame) I would greatly appreciate it! It’s not my actual application, but it would make prototyping and answering my questions myself much easier!
Thank you in advance! I’ll try to answer these later on if there isn’t much response