Sure, I’m using:
lr = 1e-3
lrs = np.array([lr/100,lr/10,lr])
# 1
learn.fit(lrs, 1, cycle_len=5, use_clr=(20,10))
# 2
learn.freeze_to(-1)
learn.fit(lrs/2,1, cycle_len=5, use_clr=(20,10))
# 3
learn.freeze_to(-2)
learn.fit(lrs/2,1, cycle_len=5, use_clr=(20,10))
# 4
learn.freeze_to(0)
learn.fit(lrs/4,1, cycle_len=5, use_clr=(20,10))
# 5
learn.freeze_to(0)
learn.fit(lrs/5,1, cycle_len=5, use_clr=(20,10))
# 6
learn.freeze_to(0)
learn.fit(lrs/5,1, cycle_len=5, use_clr=(20,10))
Usually by the end of cycle 3 (15 epochs), my mAP gets to 1-2 points of what they would be after cycle 6 so I often stop it there to make comparisons and save time/GPU.
My latest comparison using the original actn_to_bb function after 6 cycles is:
- FPN - 41.8% mAP
- noFPN - 43.0% mAP
My accidental use of the non tanh(actn) and clamping the outputs of actn_ctrs and actn_hw did seem to lead to a working FPN compared to noFPN (even if the overall scores were lower than baseline). This suggests I should continue exploring changes to how we draw bounding boxes from the activations. I tried using your encode_gt2anc and decode_preds functions with both Smooth L1 Loss and reg L1 Loss but could only get to mAP 37-39% so far.
Settings and interesting observations from my earlier post that seem to remain helpful:
f_model=resnet34
size=224
batch_size=32
# anchors:
anc_grids = [28,14,7,4,2,1]
anc_zooms = [.7, 2**0, 2**(1/3), 2**(2/3)]
anc_ratios = [(1.,1.), (.5,1.), (1.,.5), (3.,1.), (1.,3.)]
len(anchors), k
(21000, 20)
row-by-row anchor box creation and x.permute(0,2,3,1).contiguous() as seen in pascal-multi
# in ssd_1_loss:
pos = gt_overlap > 0.4 # having tried higher and lower, 0.4 seems the best balance
clas_loss = loss_f(b_c, gt_clas)/len(pos_idx) # normalizing loss by num of matched anchors
# in FocalLoss:
alpha,gamma = 0.25,1. # had gamma=2. in older notebooks, haven't checked how much difference this makes on mAP
# in SSD custom head, forward pass. Adding ReLU after grabbing 28x28 and 14x14 feature maps from resnet34:
c1 = F.relu(self.sfs[0].features) # 28
c2 = F.relu(self.sfs[1].features) # 14
Things I tried that didn’t help or actually hurt performance:
- bilinear vs nearest interpolation for upsampling feature maps didn’t make much difference. Bilinear’s easier to work with because it can do 4x4–>7x7 without throwing an error.
- using F.upsample and a separate “upsamp_add” function in the forward loop dropped performance 1-2 points versus defining nn.Upsample layers in
__init__
and then manually adding the upsampled outputs to lateral layer outputs in the forward pass (i.e. p5 = self.upsamp2(p6) + self.lat5(c5)
). No idea yet why this makes a difference.
- similarly, defining one outconv layer (
self.out = OutConv(k, 256, bias)
) and using a for loop to create a list of outputs that we then concat hurt performance by 3 points vs defining separate outconvs for each layer and concat-ing those (i.e. [torch.cat([o1c,o2c,o3c,o4c,o5c,o6c], dim=1), torch.cat([o1l,o2l,o3l,o4l,o5l,o6l], dim=1)]
).
- using smoothing layers after upsamp+add appears to create the correct feature maps (see images below) but including them slows down training by an order of magnitude (1e-4) and I can’t seem to train to the same mAP.
Lateral layer outputs (no upsampling or adding, just the 1x1 conv):
Smoothing layer outputs (lateral + upsamp from 7x7 and then conv 3x3 with stride 1 pad 1):