How to Approach This Type of Vision Problem?

Hi there.

I’m wondering if anyone has some advice for how to approach problems where we have multiple images per label? Concretely, I have multiple images for a particular object at different depths and perspectives and the the object belongs to a particular class which needs to be classified.

The naive thing to do is to replicate the label for each image and treat them independently in the model, but I think it would be better to have the model learn the correct label for all the images belonging to the same object collectively. But, I don’t know the best way to do that. Any suggestions?

One thought I have is to just concatenate all the images belonging to the same object along either the row or column dimension. So, if I have 5 images belonging to the same object and they’re all 3x250x250, the resulting concatenated tensor would be 3x1250x250 if I concatenated across the row dimension or 3x250x1250 across the column dimension. The issues with this technique are that my tensors will be really large and so the batch-size will need to really low and, more problematically, I have a different number of images for each object. One object might have 5 images while a second object might have 18 images, while a third might have 2 images. Of course, I could pick some baseline number, n, that defines how many random images will be selected and concatenated from each object in creating a batch of tensors.

Also, maybe my google skills are subpar, but I couldn’t easily find literature on this topic in the academic canon or otherwise. So, if you are aware of any papers that deal with this topic, I would be greatly obliged if you passed them along.


No thoughts on this?

I’m sure someone else has tackled this problem before that’s read this, so maybe I didn’t explain myself.

As an example, say you’re trying to classify images of properties as one of three things: “a single-family residence”, “a condo”, or “a duplex”. For each individual property in your dataset you have multiple images of that property from multiple different angles. The goal is to build a model that can assess multiple images from a single property and correctly classify the property. The question is, how do you structure your data and model to take advantage of the fact that you have multiple images of the same thing and they all should get the same classification?

The naive approach of repeating the labels how does it work?

I’m sure it would do OK, but I can guarantee it would do worse than if you we let the model know that “hey, all these photos should get the same label because they’re all the same thing”. You might say well if the naive approach is good enough then don’t bother with something more complex, and I get that, but imagine this were a Kaggle competition…I’m pretty sure the naive approach would not get you top 10%.

I imagine the problem is that you don’t have multiple pictures per location for all locations, so the network should have a variable input size.
You could train multiple networks and ensemble their results. I have done this for multiple images for one output (different google maps images to produce one output). But the best approach appears to be to train the models separately and then make a regression on the feature map of each to ensemble them together.
You need different models if the images are different takes from the same object, because convolutions are tied to the spatial window they are looking. Just to put it another way, stacking your various images in a multi-channel meta image, and feeding that to a Resnet would not work.
There are some papers about this in the net.