How to train the RPN in Faster R-CNN?

Link to paper

I’m trying to understand the region proposal network in faster rcnn. I understand what it’s doing, but I still don’t understand how training exactly works, especially the details.

Let’s assume we’re using VGG16’s last layer with shape 14x14x512 (before maxpool and with 228x228 images) and k=9 different anchors. At inference time I want to predict 92 class labels and 94 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
from the paper

In the paper they write

“we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1”

That’s the part I’m not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?

Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.

I appreciate your help.

Hi @mimi, what I understood from the paper is that, once you have convolution feature map, a sliding window is passed along the feature map.For each sliding window, you get k=9 different anchors.So, if the size of the feature map is 14x14x512, you can get nearly 14x14x9 anchors. Of all these anchors they randomly sample from +ve and -ve anchors and comprise the mini-batch. So in this process they only use one network for all anchors.

1 Like