Hi there @chunduri , good question …the below is my understanding:
For the SConv layers, we can sort of freely set channel length based on how many features we want to learn at each scale (I believe Jeremy mentioned he chose 256 to match the SSD paper), not based on # of predictions.
For the OutConv layers, channel length (depth) is determined by the # of predictions at each image region (grid cell). (So for a given grid cell, we can think of it as stacking that cell’s predictions one on top of the other along the channel dim.)
If we were to combine the classification task and localization task in the same tensor, you are right that this implies a channel depth of 225 (K*(4+C+1)).
However, we use separate convolutional “branches” for the two different tasks, and these are implemented as
self.oconv2 within the
OutConv layer class:
class OutConv(nn.Module): def __init__(self, k, nin, bias): ... self.oconv1 = nn.Conv2d(nin, (len(id2cat)+1)*k, 3, padding=1) self.oconv2 = nn.Conv2d(nin, 4*k, 3, padding=1)
(In the nb, K is set to 9, not 189: K = the number of anchor box default “types”: 3 zooms * 3 aspect ratios = 9 combinations.)
The 2nd arg passed to
oconv2 is output channel depth:
- (C+1) * K for
oconv1, which is responsible for classification: 20+1 predictions for each of the 9 anchor box types: (20+1) * 9 = 189, hence 189 depth for o1c, o2c, and o3c.
- 4 * K for
oconv2, which is responsible for localization: 4 bbox coords for each of the K anchor box types: 4 * 9 = 36, hence 36 depth for o1l, o2l, o3l.
(Note, these layers all get flattened and then concatenated into a different shape in the end. Also, the logic for channel depth applies across all grid scales.)