Hi @mimi, what I understood from the paper is that, once you have convolution feature map, a sliding window is passed along the feature map.For each sliding window, you get k=9 different anchors.So, if the size of the feature map is 14x14x512, you can get nearly 14x14x9 anchors. Of all these anchors they randomly sample from +ve and -ve anchors and comprise the mini-batch. So in this process they only use one network for all anchors.
1 Like