Multiple classes and bounding boxes in a single image

Hi all,

I’ve been working through Lesson 7, and have a pretty good understanding of using annotations - specifically the bounding boxes - to train the network, and to predict bounding boxes when a single instance of a category/class appears in an image.

However, I would like to try and extend this, so that I can train the network to look for:

  1. multiple categories/classes in a single image (for example, 2 or more different types of fish in a single image - although not necessarily fish); and
  2. their respective bounding boxes

so that the final prediction outputs, for example 2, or 3, or 5 bounding boxes and the (different) type(s) of classes identified in the image. Any given image may have zero or more different categories/classes.

However, I’m struggling a little trying to tie it all together, and how to build the architecture such that an arbitrary number of classes and bounding boxes can be in a single image, and wondered if anyone had any pointers in the right direction.

Finally, I’ve been attempting to read an understand some papers (for example on Faster R-CNNs, segmentation), which seem to do this - perhaps this is a way forward (not being a math expert, it’s taking me a while to get my head around it all, and see how it can be coded in Keras!).

Thanks for any pointers/discussion,



1 Like

Hi Paul,

I have been away from the subject of machine learning for a couple of months now so my recollection can be a bit iffy at best, but here is a link that can get you started in the right direction: YOLO Real-Time Object Detection.

I am sure more knowledgeable folks will chime in but just sharing my 2 cents in case they can be of help.

And here is another link for a paper on Single Shot Multibox Detectors.

Once you have a starting point it is relatively easy to start googling around ;).

All the best,

1 Like

I would also recommend having a look at YOLO. I have studied various architectures and YOLO was the easiest to understand. I built an object detection system in Tensorflow where I tried to keep the code clean and simple. Maybe it could help you get started [1].

I would also recommend having a look at “Focal Loss for Dense Object Detection”. They come up with a cool loss function that helps with class imbalance.




Thanks for the pointers so far - at least I have an idea of what to search for, and investigate/research now (with so many terms and abbreviations, I wasn’t even sure how to phrase, or articulate what I’m aiming to do!).

I’m hoping to build something (relatively simple) from the ground up (using Keras, or PyTorch) based on everything I’ve learned so far in the course. It might seem like reinventing the wheel, but I’d like to think it’ll give me a deeper understanding of the meat and bones, in the hope that one day I can read these papers and think “aaahhhh!”, instead of “huh?!”.

Thanks once again!