Object detection models which use multiple pictures of the same object from different angles

Hi, I would like to detect defects in metal parts. I have multiple pictures of the same object taken from different angles. Some defects are easier to detect, if you look at two (or three) of these pictures from different angles, rather than looking at just one. However, the object detection models I know of (such as RetinaNet, or Faster R-CNN) process one image at a time.

Do you know of any model(s) which can look at multiple pictures of the same object at once?

You can use batch processing.

But I don’t get what do you mean by looking at multiple pictures? If images are at different angles, then they essentially become different images (unless you have some angle information on how to combine them).

Hi Kushai,

I’m not talking about mini-batches. I’m saying that I want to build an object detection model which takes two (or three, but let’s say two to fix ideas) images as an input, and produces an output (a vector of bounding boxes and classes). Say, a front picture and a back picture of the same object. I think I can get some angle information, as long as it doesn’t have to be incredibly accurate.

let’s do a thought experiment, in dog vs cat, suppose we concat the image with mirroring cat/dog, will it make any different? I would argue no.

1 Like

Your thought experiment is unrelated to my use case. I have two different pictures of the same object, not the same picture subject to data augmentation. Think orthographic projections:


While I don’t quite know of models which look at multiple pictures of the same object at once, you could a couple of approaches to see if they are useful:

  1. Concatenate the two (or more) images together into a single image while training your model, and then see if it gives you useful results
  2. Create different models for each angle of the object (i.e., different models for the top view, front view, right view etc), and then use an ensemble of different models to classify defects

Very interesting suggestions, thanks a bunch! The first idea is self-explanatory: regarding the second udea, any suggestions on how to ensemble in practice?

I’m doing object detection. Suppose I have model A, receiving image X_A in input, and model B, receiving image X_B in input. Now, model A will generate an output consisting of zero or more bounding boxes, and zero or more classification scores (or probabilities). Model B will also output zero or more bounding boxes, with associated class probabilities. How could I combine them? I guess I should first of all transform the coordinates of the bounding boxes to a common reference frame. In doing this, @kushaj suggestion about using the angle(s) information would be key. Interesting!

So you have two images of the same dog. Consider one image to be the reference one and the other image is rotated by some angle theta. Now you want to detect the bounding boxes for dog in both the images and then combine the results.

To achieve this you can first do object detection on the two images independently (using mini-batch if you want) and then combine the results. To combine the results you just need to rotate the second image bounding boxes by -theta.

Code to rotate an array is

def _rotate(degress):
    angle = degrees * math.pi / 180
    return [[cos(angle), -sin(angle), 0.],
            [sin(angle),  cos(angle), 0.],
            [0.,          0.,         1.]]

Now to combine the predictions you can use Non-Maximum Suppression.


Nice! I’ll try this. Thanks!

that’s why it’s a thought experiment, it’s not mirroring the same image that matters, I originally hope you can see the reason behind that.
let’s go further, suppose we concat every 100 cats into 1 image, and do the same to the dogs, I’d argue we can still get good result.
reason is that cnn begin locally and build a feature hierarchy step by step, absolute location does not matter, relative location(or composed feature) make the final difference.

If you concat the image then you are increasing the input size. And I have seen in various object detection experiments, that for larger images the accuracy of the model decreases.

1 Like

I disagree. We are talking about the object detection task here, not image classification. The absolute location does matter. If I predict that there are two dogs in an image, but I predict them to be in the wrong location, the intersection over union will be low or zero and the mAP will also be low.

I have the same issue,did you find the wish model we can used it?