ObjectDetectionInterpretation implementation

(Pierre Ouannes) #1

Hi !

I’m creating this topic to discuss the ObjectDetectionInterpretation class I’m developing to inspect results from Object Detection with a similar API as ClassificationInterpretation.
Any suggestion is most welcome !

I have a question for you @sgugger if you don’t mind. One issue I’m running into is in the loss_batch function (in basic_train) on this line :

if not loss_func: return to_detach(out), yb[0].detach()

Only the first element of yb is returned (I don’t really know why?). In Object Detection yb[0] is the bounding boxes targets, and yb[1] is the classes targets, and both are needed for get_preds. So what would you suggest ?
Can I modify loss_batch directly to return also yb[1] (in that case validate would also need to be modified, at least)? I’m not sure how that would impact other applications so that’s why I’m asking. The alternative is to developp a get_preds method specific to ObjectDetection.

Could you advise me on the best course of action ?

Thanks !

1 Like


Ah that’s tricky. It returns yb[0] because yb has been listified and we need the actual element in that case. Not too sure how it’s gong to work with a listy target.
In you case you can try with to_detach(yb) and in general I think it should be something like that, with a potential squeeze to remove useless dims, but it’d need more test.


(Pierre Ouannes) #3

Oh right, I forgot it was listified.
I’ll try to make something work then, thanks for your input !


(Pierre Ouannes) #4

If I may, I have another question on the design of the Object Detection API you want for fastai (I’ll do my best not to bother you with too much questions but I feel this is an important one as it concerns fastai design).

As you know I’m working off your implementation of RetinaNet. In it the forward pass of the model outputs three things in a list :

  1. The classification predictions, of size batch size x number of anchors x number of classes
  2. The BBox predictions, of size batch size x number of anchors x 4
  3. (if I understood correctly) The sizes of the feature maps that the FPN outputs.

Can I assume that the first 2 is what an Object Detection model will always output ?

Similar question for the labels : in the RetinaNet notebook for each image the class labels are not one-hot encoded as they are in the model output. Can I assume that will always be the case ?



Since I haven’t had the time to train and finish this part I can’t answer those questions now. So for now do your best and we’ll adapt whatever you come with if there is a change of API.

1 Like