I like this statement a lot, I think it can be very helpful to someone learning this:
No problem, I’ll help you out: I’ll just feed you a pre-defined list of anchor boxes you can use. All you have to do is shift them around a little, or maybe scale them a bit, so that they contain whatever is in the vicinity of that box. Oh, and I’ll need you to tell me what the class is of the object is you’ve got in those boxes.
On the loss function, I am not sure the paragraph is very clear. Overlap information is only used for assigning ground truth boxes / predictions to anchor boxes. It is non differentiable and we cannot use it to backpropagate the error. The loss is based on incorrectly predicting class (or saying an object of some class is assigned to an anchor box when in actuality there is none) and the error in offset predictions.
As for the last paragraph - its single shot because there is a single network we send the image through. Other archs might consist of multiple stages, hence the naming. For instance, we might first have some model that detects regions of interest, another stage that does classification on those regions, another stage to refine predictions, etc.