Unclear process on lesson 8 (Object Detection)

At the end of the lesson 8, Jeremy started modeling the object detector. He started off building a simple image classifier, which i got it completely, a basic image classifier.
But atfter that, he build a model for drawing the box. How does feeding the image and the coordinates in a neural network predicted where the box should be in different images? It was unclear to me how the network predicted that based on just the two informations. Also i didn;t understood clearly the two steps he took to create this new model, if someone could explain those to me i would be grateful!

Thanks!