This and the material from lesson 8 are very difficult! Viewed videos multiple times and reading the papers, but still have a couple of questions:
Are predicted bounding boxes and classes limited by the number of anchor boxes? For example, if my architecture has 16 anchor boxes, is the model limited to predicting no more than 16 objects? If not, why not?
Can more than one object be predicted by the same anchor box? For example, assume my image are a flock of birds perched on a long branch. If the center of mass of two of the birds are in the same anchor box, will the system be forced to pick one of the two birds and ignore the other? If that is the case why? If not, why not?
I was going through this recently, so here’s some answers as understood by a beginner
Yes, the number of (predicted class, predicted bbox) matches the number of anchorboxes. So if you have 16 anchorboxes, you will have 16 sets of (predicted class, predicted bbox). So if anchorbox 1 contains an object, the activations responsible for that anchorbox have to learn how to lower the loss. If anchorbox 1,5 and 16 each has an object in them, the activations responsible for those three anchorboxes have to learn to lower their respective losses. When we consistently train them this way, overtime those activations will learn how to do good predictions (hopefully) for their respective anchorboxes.
No, it seems that each ground-truth object is matched to one anchorbox, and the activations responsible for that anchorbox “wakes up and makes a prediction and learns from its mistakes”. However, the different aspect ratios, zoom factors and different anchorbox sizes, all of which you will learn in the more painful lesson 9, can help in detecting things in the scenarios you’ve described, such as a little boy standing in front of a big car. Lesson 8 is hard, but wait till you go Lesson 9.
Very good answers by wyquek.
There are some variations according to the specific detector model architecture (mainly 1 stage vs 2 stages).
Conceptually, you can consider an anchor box in a object detection architecture like an image crop where you apply a CNN classifier. Many architectures also include global image features to refine the local anchor box classification.