I am pretty new to object detection, I am trying to implement SSD paper and I am really confused with the concept of default boxes. I am pretty sure it is used in the loss function but it’s really getting hard for me to grasp the concept of default boxes and how they can be implemented in the actual model. These are the hypothesis that I have come up with:
After adding additional convolution layers, each of these layers will produce feature maps. These feature maps are consists of cells and each of these cells we are going to bind it with set of default boxes of different aspect ratios. At each of these cells, we are going to predict offest values relative to default box, as well as per class scores. Therefore, we will have (c+4) prediction to be made for each default box. So according to this hypothesis,
- How can we add these default boxes to each of these feature map cells?
- How are we going to predict offset values with respect to the default boxes?
- I understood this hypothesis after reading the above paper. The paper says, that we are going to match our bounding boxes(ground-truth) with default boxes using IOU metric greater than 0.5. So accordingly this will reduce the number of default boxes since we are choosing the best default box which is matching to the ground-truth bounding box. Is this assumption correct?
Black boxes are default boxes in the above image.