Mean average precision (mAP)

daveluo · April 2, 2018, 4:09pm

You take the calculation of precision separately for each class. So in your example:

given 5 predictions (3 for class one, 2 for class two) and a ground truth of 1 object in class one (and assuming none in class two)
for class one, you have 1 TP and 2 FP
for class two, you have 0 TP and 2 FP

The calculation of precision (TP/(TP+FP)) would give:

for class one: 1/3
for class two: 0/2

The gap in my understanding that I haven’t fully filled yet is that average precision is different than precision.

While precision is defined as TP/(TP+FP), average precision is related to the (approximate) area under the precision-recall curve for each class. Like this (fig from here):

Found some good overviews of the difference here:

Best overview imo: https://medium.com/@timothycarlen/understanding-the-map-evaluation-metric-for-object-detection-a07fe6962cf3

https://sanchom.wordpress.com/tag/average-precision/

https://datascience.stackexchange.com/questions/25119/how-to-calculate-map-for-detection-task-for-the-pascal-voc-challenge

The mAP is then calculated as the mean of APs (quite a literal definition: the mean of the average precisions per class). You can also examine the AP for each class, like this (from here):

And in object detection papers, you often see something like mAP@[.5:.95] which denotes an average of the mAPs across different IoU thresholds for considering if a prediction counts as a TP. More on this here: computer vision - What does the notation mAP@[.5:.95] mean? - Data Science Stack Exchange

What a confusing phrase to say out loud: “average of the mean average precisions”

I’ve been working on fully understanding and implementing mAP as well. It’s more complex and ambiguous than I expected. Perhaps we should start a new thread on this?

Btw, here are some github references I’ve been using to check my implementation:

https://gist.github.com/tarlen5/008809c3decf19313de216b9208f3734

https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py

https://github.com/amdegroot/ssd.pytorch/blob/master/eval.py

https://github.com/Cartucho/mAP/blob/master/main.py

sgugger · April 2, 2018, 4:17pm

You don’t say! I thought it would take me one hour this morning and I’m far from being done. Thanks for your explanation about true positives, it makes sense now.
I’m done with the part where it computes the precision/recall, just have to implement the different thresholds!

jeremy · April 2, 2018, 4:26pm

Done!

daveluo · April 2, 2018, 7:25pm

Found another write-up explaining Precision & Recall, Average Precision, and mAP specifically for object detection. This is probably the best one I’ve come across so far: https://medium.com/@timothycarlen/understanding-the-map-evaluation-metric-for-object-detection-a07fe6962cf3

Comes with code! https://gist.github.com/tarlen5/008809c3decf19313de216b9208f3734

sgugger · April 2, 2018, 10:07pm

Thanks for sharing!
I’ve finished a very inefficient way of doing this. Not sure if my code is good or not but I’m finding 0.31 for the model trained like in Jeremy’s notebook (with bias at -4, trained for two cycles and with NMS at the end).
It seems a bit low to me though, so I may have made a mistake.

jeremy · April 2, 2018, 10:20pm

Perhaps you could share a gist? There’s a few folks working on this so I’m sure they’d appreciate some code to look at!

sgugger · April 2, 2018, 10:51pm

Was just cleaning up my notebook, it’s here. I have included my model save if someone wants to try to replicate the results. The beginning is just the lesson notebook, it begins at ‘False positives and negatives’.

Speaking of which, while drawing some plots, I realized I’m still unsure on how to count them. Specifically I have this prediction/ground truth:

I was thinking it would count as, Dog: 1TP, 1FP and Sofa: 1TP. The thing that is annoying me is that the green and purple box are a hit to both ground truth objects with an IoU > 0.5, so if you use the definition some people give (one hit with the wrong class), they should each count once as FP…

daveluo · April 2, 2018, 11:16pm

Thanks for the notebook, will look through it.

Re: your question, I don’t quite understand what you mean by:

The larger dog prediction box (teal color, confidence of 0.38) should only count as 1 FP in the dog category and not be counted at all in the sofa category. This is what I’m understanding your first statement (“I was thinking it would count as, Dog: 1TP, 1FP and Sofa: 1TP”) to mean and I agree with that.

Maybe you have an error in keeping the classes separate when doing your TP/FP counts? Or do you mean that some variant of the mAP metric is supposed to count detections across different classes as FPs?

sgugger · April 2, 2018, 11:30pm

No, no I keep the counts of TP/FP separate by classes. It’s just that I’ve read some people presenting the algorithm to count them with watching the hits on ground truth object but I think they might sort them by decreasing overlaps, so the big dog and big sofa boxes are matched with ground truth sofa (bigger IoU), but not the ground truth dog.

sgugger · April 3, 2018, 2:58pm

So it’s a bit clearer with a (short) night of sleep. In the end I let myself be confused by the matching process (how to attribute a prediction to a ground truth object) which is kind of the same as in the loss function.

I’ve update the notebook to fix some bugs, get a bit more efficient and added a lot of visualization of TP/FP/FN in the hope it will help people that are as confused as I were. I’d be curious to see if my final mAP is correct or not, so any feedback is more than welcome!

daveluo · April 3, 2018, 8:36pm

Hey, thanks again for sharing and updating your notebook! Great stuff and useful learning for me, especially on making things readable and more computationally efficient.

I’ve been (very slowly) coding my way to a fuller understanding of TP, FP, and FN scoring, adjusting the model and IoU thresholds, all the way to mAP (correctly I hope).

For reproducibility and consistency across our work, I copied the first half of your notebook and loaded your trained weights (thanks for that!) to start off at the same point. I hope by the end, I’ll be able to tell you that our mAPs match up (or at least why they don’t).

For now, I wrote some new visualization/educational functions to count every TP, FP, FN, and ground truth object for every class category in an image so I could play around with the model sensitivity and IoU thresholds. Along with some commentary, it looks like this:

Here is my adapted nb: https://gist.github.com/daveluo/2ab83da32e623864e543d7251e9beef4

Hope it’s helpful. I’m moving steadily but slowly on implementation so will continue to comment as I go!

sgugger · April 3, 2018, 9:07pm

Thanks for your comments! We already disagree on one situation so I’m not sure we’ll find the same results, in his image:

I’m counting
sofa : 1TP
dog: 1FP, 1TN
because the predicted box of the dog is matched to the ground truth object sofa in my code since it’s the one it overlaps the most but it might be mistake.

sgugger · April 3, 2018, 11:36pm

Aaaaah! Got it!
I forgot we have to detect TP,FP and FN for each class separately! So you’re right and I’m wrong on this one. Will try to correct my functions tonight or tomorrow.

daveluo · April 3, 2018, 11:54pm

Thanks for the follow-up! I was scratching my head too and checking if I misunderstood something somewhere (among the incredibly confusing amount of materials about mAP…)

To lay out my current understanding using the dog and sofa example in the image:

For class ‘dog’:

calc jaccard(all ‘dog’ predictions, all ‘dog’ gt_bboxes)
for all ‘dog’ predictions where IoU > 0.5 (or a different iou_thres), take the highest IoU box and call that TP (more generally: for n ‘dog’ ground truth objects, count n-highest IoU ‘dog’ predictions as TPs)
count all other ‘dog’ predictions as FP. If all predictions for the class are <= IoU threshold, then they’re all counted as FPs? This last part I’m not too sure on yet…

Repeat for class ‘sofa’ and any other class of interest.

For mAP more generally, we loop through all classes one at a time. Even if there’s no ground truth objects in a class, we still want to catch all false positive since a prediction could be made for a class outside of ground truth (i.e. a prediction for ‘car’ in our example image would be a FP in ‘car’ class). We also want to catch all false negatives in every class (i.e. if a ‘sofa’ gt object has no predicted boxes, that would be a FN in ‘sofa’ class).

Does that sound right to you?

Btw, I’ve created a P-R curve up for category 14 (‘person’) using a range of model thresholds = np.linspace(.05, 0.95, 40, endpoint=True) which I believe is the same as your nb.

Shape looks pretty similar despite the difference of TP/FP/FN counting. I guess because across the whole image set, it’s not a significant difference in the P or R calcs:

pr_cat14

The AP of cat 14 in this P-R curve (using your avg_prec func) comes out to be 0.286.

sgugger · April 4, 2018, 2:03am

Yes, it’s all clearer to me now, thanks for re-explaining it because it sure needs the details.
I’ve finished rewriting the functions to (hopefully) have the same results as you in terms of TP/FP/FN and updated it on github. In the end, I get to 0.2863 for the AP of class 14 and it changes the whole mAP from 30.7% (bugged version) to 30.99% (new version) so it was indeed small.

Hope this last version is the good one!

daveluo · April 4, 2018, 4:48am

Awesome! I’ve also finished running the full 40-step range of model thresholds on all 20 categories. Ended up with a mAP of 30.17%. I guess the full description of the metric result should be “mAP@ 0.5 of 20 classes using 11-point interpolated average precision” to be exacting.

I’ve updated my gist as well. The code is painfully inefficient, needs refactoring, and I want to double check to see why we have a 0.8% diff in our mAPs.

Here are my P-R curves per class (with APs on top). If you could generate something similar, comparing them will probably be the quickest way to spot any major class-specific discrepancies between our methods:

Update: I had a list of mAP benchmarks here but I don’t think they’re up to date (mAP ranged from 22.7-43.7 and newest paper was from 2013). Here’s the performance table from the SSD paper:

It’s not immediately clear to me what the exact mAP performance criteria they use is…my homework for tomorrow morning

radek · April 4, 2018, 1:27pm

Great discussion in this thread!

Maybe comparing results to the MultiBox paper might be a better idea.

Not sure how to best verify the correctness of the implementations of mAP Maybe a pretrained model for one of the archs exists and we could run it to output predictions and calculate the mAP?

Other than that probably implementing quite faithfully one of the simpler models and comparing the results might be the best proxy.

Will most likely get around to doing such a thing myself but given the pace at which I am going someone is very likely to beat me to it

Anyhow, having a working implementation of mAP that we can trust would be quite cool!

sgugger · April 4, 2018, 2:07pm

Here are the figs with my version of the code

The difference between our scores (and figs) is the way we treat the zeros in TP+FP or TP+FN. I chose to put one when we have 0/0 to have nicer curves, not sure of what is the right way it should be done.

If I take the same function as you do for the precision and recall (by adding 1e-15 to the denominator) I get 30.17% as well.
I don’t think they took this definition of mAP in the SSD paper. Ours is the one that used in the 2007 competition and I think the formula changed in the evaluation for 2012. And I didn’t check but it might also be different for the COCO dataset/competitions.

daveluo · April 4, 2018, 5:03pm

Mystery solved, and quickly too! Thanks for checking and making the figs.

I added an epsilon number of 1e-15 to avoid ZeroDivisionError. Looking at the public implementations of mAP, one of them sets precision=0 if there’s a division error and two of them (including the official cocoeval.py) uses the smallest possible epsilon (np.spacing(1) = 2.220446049250313e-16) either as an addition to TP+FP or as max(TP+FP, eps).

I’ll change my treatment of division errors to that last approach (max(TP+FP, eps)) which seems most appropriate and remove eps from the recall calculation where it’s unnecessary (assuming there’s always at least 1 gt object present of every class across the full test set).

This change won’t do much for our mAP though. I’ll do some manual checking, particularly within the poorer performing classes like ‘sheep’ and ‘boat’, and see if there are any steps amiss between individual image TP,FP,FN scoring and P-R curve construction.

I also came across this paper, “Diagnosing Error in Object Detectors”, which is frequently cited. It does confirm that we are counting our TPs and FPs correctly with regards to the IoU threshold >0.5.

Informative further break-out of the types of FP errors:

One major type of error is false positives, detections that do not correspond to the target
category. There are different types of false positives which likely require different
kinds of solutions. Localization error occurs when an object from the target category
is detected with a misaligned bounding box (0.1 <= overlap < 0.5). Other overlap
thresholds (e.g., 0.2 <= overlap < 0.5) led to similar conclusions. “Overlap” is defined
as the intersection divided by union of the ground truth and detection bounding
boxes. We also consider a duplicate detection (two detections for one object) to be localization
error because such mistakes are avoidable with good localization. Remaining
false positives that have at least 0.1 overlap with an object from a similar category are
counted as confusion with similar objects. For example a “dog” detector may assign
a high score to a “cat” region. We consider two categories to be semantically similar
if they are both within one of these sets: {all vehicles}, {all animals including person},
{chair, diningtable, sofa}, {aeroplane, bird}. Confusion with dissimilar objects
describes remaining false positives that have at least 0.1 overlap with another labeled
VOC object. For example, the FGMR bottle detector very frequently detects people
because the exterior contours are similar. All other false positives are categorized as
confusion with background. These could be detections within highly textured areas
or confusions with unlabeled objects that are not within the VOC categories.

daveluo · April 4, 2018, 10:23pm

I’ve taken a tour through the lowest scoring AP classes (all < 0.100) : ‘sheep’, ‘bottle’, ‘boat’

Doing so leads me to believe our mAP calculations are essentially correct (with minor quirks like handling division by zero differently) and a roughly accurate reflection of our model performance. Taking a look at the worst performing classes also helped me IDed a few areas of improvement where tweaking our model architecture and training parameters should help increase mAP score.

ISSUE 1:
Our model struggles with very small, numerous gt objects like sheep. For example, let’s look at the TP, FP, and FN count of ‘sheep’ at different model threshold levels (the ‘model_thres’ value defined in NMS: c_mask = conf_scores[cl] > model_thres) with visualized prediction bboxes versus gt bboxes:

NMS CONFIDENCE THRESHOLD: 0.15
  2 FP 'bird'
  1 FP 'cow'
  0 TP, 19 FP, 3 FN, 3 actual 'sheep'

NMS CONFIDENCE THRESHOLD: 0.25
  0 TP, 5 FP, 3 FN, 3 actual 'sheep'

NMS CONFIDENCE THRESHOLD: 0.35
  3 FN 'sheep'

Across threshold levels, we generate a different number of False Positives as expected but we never get any TP hits on ‘sheep’ because the prediction box IoUs are always < 0.5. At the lower end, we have many FPs instead which reduce our Precision score and at the higher end, we stop making predictions and end up with all FNs.

Here are some other examples of this where the gt objects are very small:

NMS CONFIDENCE THRESHOLD: 0.15
  4 FP 'bird'
  1 FP 'cat'
  0 TP, 1 FP, 1 FN, 1 actual 'sheep'

NMS CONFIDENCE THRESHOLD: 0.15
  2 FP 'bird'
  2 FN 'sheep'

We have a similar problem with ‘bottle’ where they’re usually very small and/or numerous:

NMS CONFIDENCE THRESHOLD: 0.15
  0 TP, 1 FP, 2 FN, 2 actual 'bottle'
  2 TP, 13 FP, 3 FN, 5 actual 'person'

NMS CONFIDENCE THRESHOLD: 0.25
  1 FN 'bottle'
  2 FP 'chair'
  1 TP, 2 FP, 3 FN, 4 actual 'person'
  1 FN 'pottedplant'
  0 TP, 1 FP, 1 FN, 1 actual 'sofa'

ISSUE 2:
The model also struggles with narrow (very tall or wide) rectangles and/or scales that range from one extreme to another. Basically objects that aren’t as square-ish or of consistent scale…i.e. tall bottles or tall boats that take up different proportions of the image frame:

NMS CONFIDENCE THRESHOLD: 0.20
  3 TP, 4 FP, 1 FN, 4 actual 'bottle'

NMS CONFIDENCE THRESHOLD: 0.15
  1 TP, 8 FP, 0 FN, 1 actual 'boat'

NMS CONFIDENCE THRESHOLD: 0.25
  1 TP, 2 FP, 1 FN, 2 actual 'boat'
  1 FN 'person'

In these situations, it looks like we usually get our TPs but also many FPs that fail to be suppressed.

ISSUE 3
Our classes are not balanced in training (or evaluation). Some classes show up less often than others so our model doesn’t have as many examples to train on. The diversity of how objects are presented within images (small flocks of sheep in the corner vs one large main subject sheep) also affect the generalizability of training. This alone doesn’t explain performance but it should be a contributing factor. I’ve highlighted the number of ‘sheep’, ‘boat’, ‘bottle’ objects in our dataset:

from http://host.robots.ox.ac.uk/pascal/VOC/voc2007/workshop/everingham_cls.pdf

Here are some ideas to try which should help improve model mAP:

Use more grid cells and more zooms to create smaller anchor boxes. Something like 7x7, 4x4, 2x2, 1x1 to capture small objects like sheep and bottles.
Increase the aspect ratio parameters to have even more narrow rectangles to catch ships/bottles/people.
Try stratifying train and val datasets (like what StratifiedKFold does in scikit). And/or or upsample underrepresented classes to get more class-balanced.
Tweaking Non Max Suppression settings to adjust how many FPs remain.