In the paper SSD: Single Shot MultiBox Detector authors regress to offsets for the cx,cy,w,h of the default bounding box

Why not simply take g but not g_hat? Why we can`t just minimize the difference between l and g?

In the paper SSD: Single Shot MultiBox Detector authors regress to offsets for the cx,cy,w,h of the default bounding box

Why not simply take g but not g_hat? Why we can`t just minimize the difference between l and g?

If I remember right, it’s mainly because smooth-L1 is less sensitive to outliers.

For some more details. If you just take the difference, some values will be negative and your distances could cancel out in the average. If you take the absolute of the differences (L1 loss), then you have no gradient defined at 0 and that doesn’t work with SGD. If you square the differences (L2 loss) so there are no negatives, then your loss values will get very large and you’ll be forced to set a very small learning rate.

Smooth-L1 is the best of both worlds. It behaves like L2 near zero so you have a gradient, but like L1 away from zero so your loss doesn’t explode either. Smooth-L1 has been used since R-CNN days (probably earlier?) and works well for this task.

About the g_hats, it’s because of how they map those values back into actual coordinates. Poke around in this torchvsion utility file to see how that happens. If I remember right, the cx and cy transforms make it so the size of your proposal region does not matter (scale-invariant). And the log transforms make it so that your height and width can never be negative (which is only clear when you map the offset into an actual coordinate).

I admit that part is hard to grasp and explain. It really helped me to sit with pen and paper one day and plug in some examples. Then it becomes clearer, else it’s just kinda spatially abstract.

Thank you! Now it does make more sense to me why it is so.

However, one has to apply inverse transformation on inference time to get meaningful BBoxes.