Having trouble understanding some of the details of R-CNN (first one)

Here is what I understand (what I think I understand).

We first train out model on our images using transfer learning.

So now we have a pretrained model.

For each image in out dataset, we compute selective search on it, which makes 2000 region proposals.These 2000 region proposals are feed through our pre-trained NN ,

However we only collect the output (feature maps) from the last convolution layer. These outputs are saved to a hard disk.

These feature maps are fed into a SVM for another round of training, were another label, “no object” is added.

We also have regression model that trains based on the window coordinates that we also annotated.

So we have SVN and a regression model (two models) that we train.

1)Is the above correct?

  1. Are each of these 2000 region proposals hand-labeled (correct label (cat, dog etc) or no-object) before feeding it into the SVM?

  2. Is the regression model tied into the SVM model? Basically out loss is a combination of both regression coords and SVM classification?

Thank you.