Classifying Multiple Objects in Single Image

Hi everyone,

Absolutely loved the course - working through it again more slowly and in lesson 3 (around 54:30), someone asks if we’ll cover multiple object classification. Jeremy says we’ll likely get to it in Part 2.

A lot of part 2 went over my head but did we get to this - how to classify multiple objects in an image? Like there are 3 cats in this photo type of thing? I suppose the stuff in the final lesson on the Tiramisu architecture for segmentation is this if you counted distinct segment blobs but there would be obvious problems with this (disjoint blobs for single object e.g. person behind a pole OR single blog for multiple objects e.g. cats together on a couch)

Any pointers for multiple-object classification would be appreicated :slight_smile:



Good question. The Kaggle Sea Lions challenge is one such task. Identifying various kinds of objects in an image and then counts of each or any downstream task. Any general ideas on how to tackle these kinds of problems?

The only idea I’ve come up with is segmentations + counting blobs.

Or maybe using the heatmap approach from lesson 7.

Keen to hear if anyone has ideas for a more robust solution :slight_smile:

Object detection models (faster-rcnn, SSD and YOLO) might work.

Here is the blog post of 8th place solution of fisheries competition:


Found this thread really helpful for SSD: About bounding box localization

I’m trying it out today so I’ll let you know.

I never got Faster RCNN working on a custom data set… but maybe someone more experienced than me can give it a shot. I tried both pytorch and caffe backends. Here is a tensorflow version I haven’t tried yet.

DenseCap may be worth a look as well. It goes a little beyond what you need but I’ve heard the code is high quality and I know of at least one group that got it to work. The same group struggled with YOLO.

Isn’t that what the attention models were about? So you’d couple a CNN trained on single objects with a RNN and the RNN decides where the CNN has to look for matches. I heard that’s also how google approached the challenge of spotting house numbers in google street view pictures.

Awesome, thanks everyone for the ideas and suggestions - will start looking at YOLO and report back :slight_smile: