Retina Net

Hey, thanks for pointing out the actn_to_bb change - that was actually a holdover from some experiments I was doing to try different ways of generating the prediction bboxes from activations. I forgot to revert it so I’m re-running the comparisons now on the original actn_to_bb function.

Looks like my FPN implementation isn’t quite better yet, sigh… (39.2% vs 40.1% without FPN after 15 epochs on both).

I’m temporarily taking down my prior post to update with the new and correct comparison numbers…sorry about the premature hope :frowning:

Yeah, I’m still having trouble with it too :frowning:
Could you share again the training schedule you use? You mentioned it in the post you rook down and I forgot, but I’d like to use the same so we can benchmark our different models.

Sure, I’m using:

lr = 1e-3
lrs = np.array([lr/100,lr/10,lr])

# 1
learn.fit(lrs, 1, cycle_len=5, use_clr=(20,10))
# 2
learn.freeze_to(-1)
learn.fit(lrs/2,1, cycle_len=5, use_clr=(20,10))
# 3
learn.freeze_to(-2)
learn.fit(lrs/2,1, cycle_len=5, use_clr=(20,10))
# 4
learn.freeze_to(0)
learn.fit(lrs/4,1, cycle_len=5, use_clr=(20,10))
# 5
learn.freeze_to(0)
learn.fit(lrs/5,1, cycle_len=5, use_clr=(20,10))
# 6
learn.freeze_to(0)
learn.fit(lrs/5,1, cycle_len=5, use_clr=(20,10))

Usually by the end of cycle 3 (15 epochs), my mAP gets to 1-2 points of what they would be after cycle 6 so I often stop it there to make comparisons and save time/GPU.

My latest comparison using the original actn_to_bb function after 6 cycles is:

  • FPN - 41.8% mAP
  • noFPN - 43.0% mAP

My accidental use of the non tanh(actn) and clamping the outputs of actn_ctrs and actn_hw did seem to lead to a working FPN compared to noFPN (even if the overall scores were lower than baseline). This suggests I should continue exploring changes to how we draw bounding boxes from the activations. I tried using your encode_gt2anc and decode_preds functions with both Smooth L1 Loss and reg L1 Loss but could only get to mAP 37-39% so far.

Settings and interesting observations from my earlier post that seem to remain helpful:

f_model=resnet34
size=224
batch_size=32

# anchors:
anc_grids = [28,14,7,4,2,1]
anc_zooms =  [.7, 2**0, 2**(1/3), 2**(2/3)]
anc_ratios = [(1.,1.), (.5,1.), (1.,.5), (3.,1.), (1.,3.)]
len(anchors), k
  (21000, 20)
row-by-row anchor box creation and x.permute(0,2,3,1).contiguous() as seen in pascal-multi

# in ssd_1_loss:
pos = gt_overlap > 0.4 # having tried higher and lower, 0.4 seems the best balance
clas_loss  = loss_f(b_c, gt_clas)/len(pos_idx) # normalizing loss by num of matched anchors

# in FocalLoss:
alpha,gamma = 0.25,1. # had gamma=2. in older notebooks, haven't checked how much difference this makes on mAP

# in SSD custom head, forward pass. Adding ReLU after grabbing 28x28 and 14x14 feature maps from resnet34:
c1 = F.relu(self.sfs[0].features) # 28
c2 = F.relu(self.sfs[1].features) # 14

Things I tried that didn’t help or actually hurt performance:

  • bilinear vs nearest interpolation for upsampling feature maps didn’t make much difference. Bilinear’s easier to work with because it can do 4x4–>7x7 without throwing an error.
  • using F.upsample and a separate “upsamp_add” function in the forward loop dropped performance 1-2 points versus defining nn.Upsample layers in __init__ and then manually adding the upsampled outputs to lateral layer outputs in the forward pass (i.e. p5 = self.upsamp2(p6) + self.lat5(c5)). No idea yet why this makes a difference.
  • similarly, defining one outconv layer (self.out = OutConv(k, 256, bias)) and using a for loop to create a list of outputs that we then concat hurt performance by 3 points vs defining separate outconvs for each layer and concat-ing those (i.e. [torch.cat([o1c,o2c,o3c,o4c,o5c,o6c], dim=1), torch.cat([o1l,o2l,o3l,o4l,o5l,o6l], dim=1)]).
  • using smoothing layers after upsamp+add appears to create the correct feature maps (see images below) but including them slows down training by an order of magnitude (1e-4) and I can’t seem to train to the same mAP.

Lateral layer outputs (no upsampling or adding, just the 1x1 conv):

Smoothing layer outputs (lateral + upsamp from 7x7 and then conv 3x3 with stride 1 pad 1):

1 Like

I am getting this error with lr_find(), anyone have an idea where to look for this problem or what might this cause, because it seems like if I continue training by ignoring this it will continue training. Thanks

I’ve never seen that one. There’s something wrong with the model state at the end of the lr_find when it tries to load the model it saved at the beginning. Really weird!
Do you have more code? Did you try to remove the tmp file in your models directory? It might be corrupt for some reason?

1 Like

Thanks for the reply, I will dig deeper but at the moment it doesn’t seem to cause a problem in terms of other functions like fit but I am not sure if it causes any problem that is not obvious yet. Another question I have is that all my probabilities from retinanet are coming really low.

Here is a sample prediction for batch and corresponding probability distributions for all 21 classes after sigmoid activation:

As you may see the max is around 0.05.

Given 20 classes (from pascal):

My network outputs HxWxAx21 where A is the number of anchors per grid. Do you have any intuition behind any possible reason for such as problem ?

I am using default Focal Loss from your notebook and other helper functions from the multi-pascal notebook, the only difference I have is the way I constructed the model class. But I believe it forward pass works since I test it with a dummy input and I was able to get the two following outputs: 1 for classification subnet (HxWx21xA) and 1 for regression subnet2 (HxWx4xA).

Here is a sample prediction for a validation image. I can’t test MAP yet since the class confidences very low.

Note: I am using 1 anchor per grid yet.

Appreciate your help !

Thanks

Do you use the bias initialization at -4 (or close of that)? If so, it makes harder for the network to predict a class because it’s wired to predict background at the start, and it takes a lot of training before it gives more confident predictions.

1 Like

This my classification subnet, I initialized last layer bias with - np.log((1 - pi)/pi))

  class Subnet1(nn.Module):
        """For classification: outputs K*A"""
        def __init__(self, K, A, in_c, use_bn=False, depth=4, pi=0.01):
            super().__init__()
            
            # Number of anchors
            self.A = A
            
            # 4 block of convolutions
            self.conv = nn.Sequential(*children(conv_bn_relu())*depth)
            
            # Final convolution for prediction
            self.out_conv = nn.Conv2d(in_c, K*A, kernel_size=3, stride=1, padding=1)    
            
            # N(0, 0.01) bias = -np.log((1-pi)/pi) initialization 
            self.out_conv.weight.data.normal_(0, 0.01)
            self.out_conv.bias.data = self.out_conv.bias.data.zero_() - np.log((1-pi)/pi)
            
        def forward(self, x):
            return flatten_conv(self.out_conv(self.conv(x)), self.A)

Without bias initialization to ~ -4 it’s loss starts off from 6e3, you are right about it but still I was using the bias initialization. There must be something else which is wrong with my case.

1 Like

It will cause a nasty problem - since it’s not reloading the original model, after lr_find you’ll have a terrible set of weights with lots of zero gradients. So after this happens, you need to recreate or re-initialize your learner.

2 Likes

@sgugger Thanks for sharing your notebook. I am also trying to recreate the retina net. I see two possible problems:

  1. There is no sigmoid layer at the end of classification layer. Is that intended? I believe the paper says they added sigmoid layer at the end. If you are changing this the loss function also may need to be changed I believe.
  2. ReLU in between convolutions of classificaiton and regression layers? I think that could be needed not very sure since paper doesn’t specifically mention this.

I have yet to check with these changes myself. Will update once I am somewhere close.

@sgugger @daveluo
Hi, I am also digging and solving upon the Retinanet Implementation.
Is there any update on the Implementation of Retinanet with any backbones on fastai v1.0 or the proper implementation(with comparable mAP to keras implementation for Retinanet) for older versions of fastai(such as 0.7)?

1 Like

Great discussion, I surfed through it 2 days ago and it gave me great intuition. I’am curious did anyone get the RetinaNet to work, I know that @sgugger has a recent notebook on it, but I didn’t have much time to check it out yet.

It’s interesting that your implementation, as well as Jeremy’s implementation in lesson #9, use dropout extensively. This is in contrast with the original SSD and RetinaNet implementation where no dropout is used.

I spoke to a CV researcher the other day and he mentioned that it’s very rare that dropout is used in object detection. I’m not sure how to square this as Jeremy’s model yields a respectable 30% mAP on VOC07 using train, not trainval.

@sgugger
is there a latest updated notebook on retinanet ?

1 Like