Part 2 Lesson 9 wiki

chs820 · March 27, 2018, 3:22am

If there are two overlapping bounding boxes within an anchor box, can the model only predict one of the classes?

snagpaul · March 27, 2018, 3:22am

Depends on how we map the ground truth to represent the data. I’m thinking that would have 16 boxes with 16 categories in the example here.

rachel · March 27, 2018, 3:22am

The anchor boxes are used for calculating the loss function. Also, the predicted box is the anchor box plus activations.

narvind2003 · March 27, 2018, 3:24am

no…YOLO is also multi class prediction.
in YOLO you have a flattened dense layer with no geometry. in SSD you are doing it from the last conv2d directly. (well, as he said - this morning they released the YOLOv3 paper and it is switching to the SSD method)

memetzgz · March 27, 2018, 3:24am

def actn_to_bb(actn, anchors):
actn_bbs = torch.tanh(actn)
actn_centers = (actn_bbs[:,:2]/2 * grid_sizes) + anchors[:,:2]
actn_hw = (actn_bbs[:,2:]/2+1) * anchors[:,2:]
return hw2corners(actn_centers, actn_hw)

This is where I got the idea that the anchor boxes are the basis for the (predicted) bounding boxes . . .

okay, Rachel addressed this above in her reply to Ducky . . .

rachel · March 27, 2018, 3:25am

Your predicted boxes are anchor boxes plus activations. Since you are adding activations (which are learned), the predicted boxes can have “moved” from the location of the anchor boxes.

bharadwaj · March 27, 2018, 3:25am

How do we decide the sizes and ratios of the anchor boxes to create? Is there a API in fastai to generate them, like we do transformations?

divyansh · March 27, 2018, 3:25am

are anchor boxes and grid cell same?

In deeplearning.ai MOOC they considered them different. Grid cell was the single part of the division of the original image into NXN while anchor boxes were the predicted volume say 4X4X((4+C)XK)

Borz · March 27, 2018, 3:26am

Something I never got: do anchor boxes ever change/transform or is it more of a “let me match my anchor boxes to the grid cell and see which one works the best”?

vikbehal · March 27, 2018, 3:27am

I guess! That’s how it’s working/learning

bhollan · March 27, 2018, 3:27am

They have a very limited movement. Only 50% in x and/or y, no rotation.

But there’s also a host of anchor boxes to give you good ‘reception’ of the actual object.

Borz · March 27, 2018, 3:27am

anchor boxes are used within each grid cell. The grid cell just splits up the image, the anchors are used for loss calculation.

KevinB · March 27, 2018, 3:27am

So the activations are saying “take the anchor box and make it smaller (or larger) and adjust the x coordinate by X and the y coordinate by Y in order to get your predicted box”

Is that correct?

AmanDaVinci · March 27, 2018, 3:27am

I’m not familiar with YOLO. Can you tell me the shape of prediction outputs and how does it help in multi-class classification? I’m assuming there is only 1 bbox and 21 classes. Correct me if I’m wrong.

binga · March 27, 2018, 3:27am

It’s present in Set up model section. anc_offset, anc_ctrs, anc_sizes.

bhollan · March 27, 2018, 3:29am

“Don’t worry if it’s a bit complicated at first”

Phew!

matttrent · March 27, 2018, 3:32am

I understand why we have a 4x4 grid of receptive fields (with 1 anchor box each) to coarsely localize objects in the image. In this case, every ground truth bbox has an anchor box that is associated with it, but not every anchor box has a bounding box associated with.

What I think I’m missing is why we need multiple receptive fields at different sizes, each with multiple anchor boxes of differing ratios associated with them.

The first version already included 16 receptive fields, each with a single anchor box associated with them. With the additions, there are now many more anchor boxes to consider. Why are those additional anchor boxes necessary if we already had 16 anchor boxes to correspond with each of the possible 16 objects to detect?

Is this because you constrained how much a receptive field could move or scale from its original size? Or is there another reason?

aza · March 27, 2018, 3:32am

Could we take a different approach here that doesn’t require us to manually code a loss function with this complex logic of anchor boxes?

E.g., take an adversarial approach, where we have a second net that is trying to guess if the output of bounding boxes coming from our ConvNet or our ground truth. A method like this is used for aligning word vector spaces: https://github.com/facebookresearch/MUSE

mandroid6 · March 27, 2018, 3:33am

Even I am confused regarding this!

bhollan · March 27, 2018, 3:37am

There’s your reading list:

https://arxiv.org/abs/1506.01497
https://arxiv.org/abs/1506.02640
https://arxiv.org/abs/1512.02325
https://arxiv.org/abs/1708.02002