I am pretty new to object detection, I am trying to implement SSD paper and I am really confused with the concept of default boxes. I am pretty sure it is used in the loss function but it’s really getting hard for me to grasp the concept of default boxes and how they can be implemented in the actual model. These are the hypothesis that I have come up with:
After adding additional convolution layers, each of these layers will produce feature maps. These feature maps are consists of cells and each of these cells we are going to bind it with set of default boxes of different aspect ratios. At each of these cells, we are going to predict offest values relative to default box, as well as per class scores. Therefore, we will have (c+4) prediction to be made for each default box. So according to this hypothesis,
How can we add these default boxes to each of these feature map cells?
How are we going to predict offset values with respect to the default boxes?
I understood this hypothesis after reading the above paper. The paper says, that we are going to match our bounding boxes(ground-truth) with default boxes using IOU metric greater than 0.5. So accordingly this will reduce the number of default boxes since we are choosing the best default box which is matching to the ground-truth bounding box. Is this assumption correct?
A default box (also known as anchor box) is just an initial guess for what the bounding box is likely to be. What the network actually predicts is a refinement of this guess.
No, the default boxes are chosen up-front by the designer of the model. The CNN predicts how the boxes are modified to become the final predicted bounding boxes.
Hi, I’m quite confused on how SSD will predict anchor boxes after training? While training, the predicted boxes will be compared to the ground truth to find the biggest IoU, but after training is done and we use to predict some image, how does it works (there is no ground truth anymore)? Does the default boxes will be replaced with the dimension that it get while training? or the default boxes will be the same as the default boxes on training?
The network predicts changes to each anchor box. In other words, for each anchor box it predicts an offset for its x/y position and a scaling factor for its width and height.
When there is an object in the image, the detector chooses the anchor box that is closest to that object and has roughly the same size of the object. So the unmodified anchor box should already give a pretty good detection for that object.
The detector could just say, “hey this anchor box is the best match for the object” and leave it at that. But even though that anchor box will be the closest possible match from all the available anchor boxes, it’s still not an exact match. That’s why the model also predicts how the anchor box should be tweaked (x/y offset, w/h scale) to fit the object better.