Mean average precision (mAP)



Found another write-up explaining Precision & Recall, Average Precision, and mAP specifically for object detection. This is probably the best one I’ve come across so far:

Comes with code!


Thanks for sharing!
I’ve finished a very inefficient way of doing this. Not sure if my code is good or not but I’m finding 0.31 for the model trained like in Jeremy’s notebook (with bias at -4, trained for two cycles and with NMS at the end).
It seems a bit low to me though, so I may have made a mistake.


Perhaps you could share a gist? There’s a few folks working on this so I’m sure they’d appreciate some code to look at!

Was just cleaning up my notebook, it’s here. I have included my model save if someone wants to try to replicate the results. The beginning is just the lesson notebook, it begins at ‘False positives and negatives’.

Speaking of which, while drawing some plots, I realized I’m still unsure on how to count them. Specifically I have this prediction/ground truth:

I was thinking it would count as, Dog: 1TP, 1FP and Sofa: 1TP. The thing that is annoying me is that the green and purple box are a hit to both ground truth objects with an IoU > 0.5, so if you use the definition some people give (one hit with the wrong class), they should each count once as FP…

1 Like

Thanks for the notebook, will look through it.

Re: your question, I don’t quite understand what you mean by:

The larger dog prediction box (teal color, confidence of 0.38) should only count as 1 FP in the dog category and not be counted at all in the sofa category. This is what I’m understanding your first statement (“I was thinking it would count as, Dog: 1TP, 1FP and Sofa: 1TP”) to mean and I agree with that.

Maybe you have an error in keeping the classes separate when doing your TP/FP counts? Or do you mean that some variant of the mAP metric is supposed to count detections across different classes as FPs?

No, no I keep the counts of TP/FP separate by classes. It’s just that I’ve read some people presenting the algorithm to count them with watching the hits on ground truth object but I think they might sort them by decreasing overlaps, so the big dog and big sofa boxes are matched with ground truth sofa (bigger IoU), but not the ground truth dog.

So it’s a bit clearer with a (short) night of sleep. In the end I let myself be confused by the matching process (how to attribute a prediction to a ground truth object) which is kind of the same as in the loss function.

I’ve update the notebook to fix some bugs, get a bit more efficient and added a lot of visualization of TP/FP/FN in the hope it will help people that are as confused as I were. I’d be curious to see if my final mAP is correct or not, so any feedback is more than welcome!


Hey, thanks again for sharing and updating your notebook! Great stuff and useful learning for me, especially on making things readable and more computationally efficient.

I’ve been (very slowly) coding my way to a fuller understanding of TP, FP, and FN scoring, adjusting the model and IoU thresholds, all the way to mAP (correctly I hope).

For reproducibility and consistency across our work, I copied the first half of your notebook and loaded your trained weights (thanks for that!) to start off at the same point. I hope by the end, I’ll be able to tell you that our mAPs match up (or at least why they don’t).

For now, I wrote some new visualization/educational functions to count every TP, FP, FN, and ground truth object for every class category in an image so I could play around with the model sensitivity and IoU thresholds. Along with some commentary, it looks like this:

Here is my adapted nb:

Hope it’s helpful. I’m moving steadily but slowly on implementation so will continue to comment as I go!


Thanks for your comments! We already disagree on one situation so I’m not sure we’ll find the same results, in his image:

I’m counting
sofa : 1TP
dog: 1FP, 1TN
because the predicted box of the dog is matched to the ground truth object sofa in my code since it’s the one it overlaps the most but it might be mistake.

Aaaaah! Got it!
I forgot we have to detect TP,FP and FN for each class separately! So you’re right and I’m wrong on this one. Will try to correct my functions tonight or tomorrow.


Thanks for the follow-up! I was scratching my head too and checking if I misunderstood something somewhere (among the incredibly confusing amount of materials about mAP…)

To lay out my current understanding using the dog and sofa example in the image:

For class ‘dog’:

  1. calc jaccard(all ‘dog’ predictions, all ‘dog’ gt_bboxes)
  2. for all ‘dog’ predictions where IoU > 0.5 (or a different iou_thres), take the highest IoU box and call that TP (more generally: for n ‘dog’ ground truth objects, count n-highest IoU ‘dog’ predictions as TPs)
  3. count all other ‘dog’ predictions as FP. If all predictions for the class are <= IoU threshold, then they’re all counted as FPs? This last part I’m not too sure on yet…

Repeat for class ‘sofa’ and any other class of interest.

For mAP more generally, we loop through all classes one at a time. Even if there’s no ground truth objects in a class, we still want to catch all false positive since a prediction could be made for a class outside of ground truth (i.e. a prediction for ‘car’ in our example image would be a FP in ‘car’ class). We also want to catch all false negatives in every class (i.e. if a ‘sofa’ gt object has no predicted boxes, that would be a FN in ‘sofa’ class).

Does that sound right to you?

Btw, I’ve created a P-R curve up for category 14 (‘person’) using a range of model thresholds = np.linspace(.05, 0.95, 40, endpoint=True) which I believe is the same as your nb.

Shape looks pretty similar despite the difference of TP/FP/FN counting. I guess because across the whole image set, it’s not a significant difference in the P or R calcs:


The AP of cat 14 in this P-R curve (using your avg_prec func) comes out to be 0.286.


Yes, it’s all clearer to me now, thanks for re-explaining it because it sure needs the details.
I’ve finished rewriting the functions to (hopefully) have the same results as you in terms of TP/FP/FN and updated it on github. In the end, I get to 0.2863 for the AP of class 14 and it changes the whole mAP from 30.7% (bugged version) to 30.99% (new version) so it was indeed small.

Hope this last version is the good one!

1 Like

Awesome! I’ve also finished running the full 40-step range of model thresholds on all 20 categories. Ended up with a mAP of 30.17%. I guess the full description of the metric result should be “mAP@ 0.5 of 20 classes using 11-point interpolated average precision” to be exacting.

I’ve updated my gist as well. The code is painfully inefficient, needs refactoring, and I want to double check to see why we have a 0.8% diff in our mAPs.

Here are my P-R curves per class (with APs on top). If you could generate something similar, comparing them will probably be the quickest way to spot any major class-specific discrepancies between our methods:

Update: I had a list of mAP benchmarks here but I don’t think they’re up to date (mAP ranged from 22.7-43.7 and newest paper was from 2013). Here’s the performance table from the SSD paper:

It’s not immediately clear to me what the exact mAP performance criteria they use is…my homework for tomorrow morning :slight_smile:


Great discussion in this thread!

Maybe comparing results to the MultiBox paper might be a better idea.

Not sure how to best verify the correctness of the implementations of mAP :thinking: Maybe a pretrained model for one of the archs exists and we could run it to output predictions and calculate the mAP?

Other than that probably implementing quite faithfully one of the simpler models and comparing the results might be the best proxy.

Will most likely get around to doing such a thing myself but given the pace at which I am going someone is very likely to beat me to it :slight_smile:

Anyhow, having a working implementation of mAP that we can trust would be quite cool!

1 Like

Here are the figs with my version of the code

The difference between our scores (and figs) is the way we treat the zeros in TP+FP or TP+FN. I chose to put one when we have 0/0 to have nicer curves, not sure of what is the right way it should be done.

If I take the same function as you do for the precision and recall (by adding 1e-15 to the denominator) I get 30.17% as well.
I don’t think they took this definition of mAP in the SSD paper. Ours is the one that used in the 2007 competition and I think the formula changed in the evaluation for 2012. And I didn’t check but it might also be different for the COCO dataset/competitions.

1 Like

Mystery solved, and quickly too! Thanks for checking and making the figs.

I added an epsilon number of 1e-15 to avoid ZeroDivisionError. Looking at the public implementations of mAP, one of them sets precision=0 if there’s a division error and two of them (including the official uses the smallest possible epsilon (np.spacing(1) = 2.220446049250313e-16) either as an addition to TP+FP or as max(TP+FP, eps).

I’ll change my treatment of division errors to that last approach (max(TP+FP, eps)) which seems most appropriate and remove eps from the recall calculation where it’s unnecessary (assuming there’s always at least 1 gt object present of every class across the full test set).

This change won’t do much for our mAP though. I’ll do some manual checking, particularly within the poorer performing classes like ‘sheep’ and ‘boat’, and see if there are any steps amiss between individual image TP,FP,FN scoring and P-R curve construction.

I also came across this paper, “Diagnosing Error in Object Detectors”, which is frequently cited. It does confirm that we are counting our TPs and FPs correctly with regards to the IoU threshold >0.5.

Informative further break-out of the types of FP errors:

One major type of error is false positives, detections that do not correspond to the target
category. There are different types of false positives which likely require different
kinds of solutions. Localization error occurs when an object from the target category
is detected with a misaligned bounding box (0.1 <= overlap < 0.5). Other overlap
thresholds (e.g., 0.2 <= overlap < 0.5) led to similar conclusions. “Overlap” is defined
as the intersection divided by union of the ground truth and detection bounding
boxes. We also consider a duplicate detection (two detections for one object) to be localization
error because such mistakes are avoidable with good localization. Remaining
false positives that have at least 0.1 overlap with an object from a similar category are
counted as confusion with similar objects. For example a “dog” detector may assign
a high score to a “cat” region. We consider two categories to be semantically similar
if they are both within one of these sets: {all vehicles}, {all animals including person},
{chair, diningtable, sofa}, {aeroplane, bird}. Confusion with dissimilar objects
describes remaining false positives that have at least 0.1 overlap with another labeled
VOC object. For example, the FGMR bottle detector very frequently detects people
because the exterior contours are similar. All other false positives are categorized as
confusion with background. These could be detections within highly textured areas
or confusions with unlabeled objects that are not within the VOC categories.


I’ve taken a tour through the lowest scoring AP classes (all < 0.100) : ‘sheep’, ‘bottle’, ‘boat’

Doing so leads me to believe our mAP calculations are essentially correct (with minor quirks like handling division by zero differently) and a roughly accurate reflection of our model performance. Taking a look at the worst performing classes also helped me IDed a few areas of improvement where tweaking our model architecture and training parameters should help increase mAP score.

Our model struggles with very small, numerous gt objects like sheep. For example, let’s look at the TP, FP, and FN count of ‘sheep’ at different model threshold levels (the ‘model_thres’ value defined in NMS: c_mask = conf_scores[cl] > model_thres) with visualized prediction bboxes versus gt bboxes:

  2 FP 'bird'
  1 FP 'cow'
  0 TP, 19 FP, 3 FN, 3 actual 'sheep'

  0 TP, 5 FP, 3 FN, 3 actual 'sheep'

  3 FN 'sheep'

Across threshold levels, we generate a different number of False Positives as expected but we never get any TP hits on ‘sheep’ because the prediction box IoUs are always < 0.5. At the lower end, we have many FPs instead which reduce our Precision score and at the higher end, we stop making predictions and end up with all FNs.

Here are some other examples of this where the gt objects are very small:

  4 FP 'bird'
  1 FP 'cat'
  0 TP, 1 FP, 1 FN, 1 actual 'sheep'

  2 FP 'bird'
  2 FN 'sheep'

We have a similar problem with ‘bottle’ where they’re usually very small and/or numerous:

  0 TP, 1 FP, 2 FN, 2 actual 'bottle'
  2 TP, 13 FP, 3 FN, 5 actual 'person'

  1 FN 'bottle'
  2 FP 'chair'
  1 TP, 2 FP, 3 FN, 4 actual 'person'
  1 FN 'pottedplant'
  0 TP, 1 FP, 1 FN, 1 actual 'sofa'

The model also struggles with narrow (very tall or wide) rectangles and/or scales that range from one extreme to another. Basically objects that aren’t as square-ish or of consistent scale…i.e. tall bottles or tall boats that take up different proportions of the image frame:

  3 TP, 4 FP, 1 FN, 4 actual 'bottle'

  1 TP, 8 FP, 0 FN, 1 actual 'boat'

  1 TP, 2 FP, 1 FN, 2 actual 'boat'
  1 FN 'person'

In these situations, it looks like we usually get our TPs but also many FPs that fail to be suppressed.

Our classes are not balanced in training (or evaluation). Some classes show up less often than others so our model doesn’t have as many examples to train on. The diversity of how objects are presented within images (small flocks of sheep in the corner vs one large main subject sheep) also affect the generalizability of training. This alone doesn’t explain performance but it should be a contributing factor. I’ve highlighted the number of ‘sheep’, ‘boat’, ‘bottle’ objects in our dataset:


Here are some ideas to try which should help improve model mAP:

  1. Use more grid cells and more zooms to create smaller anchor boxes. Something like 7x7, 4x4, 2x2, 1x1 to capture small objects like sheep and bottles.
  2. Increase the aspect ratio parameters to have even more narrow rectangles to catch ships/bottles/people.
  3. Try stratifying train and val datasets (like what StratifiedKFold does in scikit). And/or or upsample underrepresented classes to get more class-balanced.
  4. Tweaking Non Max Suppression settings to adjust how many FPs remain.

Wow, thanks for the amazing summary!
The model clearly need more anchors, which is what they do in the original SSD paper. I’m trying to roughly replicate what they do (with a total of 8732 anchors!) and will share as soon as it’s ready. Maybe it will help with the narrow or the very little objects.

1 Like

This is great analysis. To add to the suggestions:

  • I’m not at all convinced the way I use tanh and turn the activations to anchor box changes is much good. It’s really just the first thing I came up with. Maybe check the yolo3 paper and see what they do, and try changing our code to do the same thing - does that help?
  • The papers don’t have different zoom levels IIRC - so they have lower k, but more conv grid outputs
  • Try making out custom head look more similar to the retinanet and/or yolo3 custom heads
  • Try using a feature pyramid network. This is what I plan for lesson 14, but I haven’t started on it yet

Overall, I’d suggest trying to gradually move our code closer to the retinanet and yolo3 papers, since they get great results so we know it should work!