SSD Object Detection overfits very quickly and ends up detecting `person` over shadowing all other classes

I tried to recreate the work done using SSD object detection on the pascal dataset using @rohitgeo and @joseadolfo excellent notebooks as a reference. Thank you to both. This was immensely useful in understanding the concepts.


However, when I train my models I end up plateauing at a specific loss. I’ve tried multiple approaches to train

  1. Train the last layers
  2. unfreeze the last 2 layers followed by unfreezing everything.

Also, I tried the approach where I started with training the unfrozen last 2 layers.

When I try to approach as in @joseadolfo’s notebook I end up overfitting with the model only predicting person. My understanding is that there are lot more samples of person but I also notice that in the sample notebook there is a steady fall in training and validation loss, however, I end up with a steep drop early on and plateau. I tried training it for 120 epochs before I realized that the overfitting could’ve affected the weights significantly.

One difference in my approach was not choosing to divide by 224 as in @rohitgeo’s notebook which worked well when I attempted SingleObjectDetection using 4x4 grids alone.

I’ve tried approaching it with discriminative learning rates but it ends up with similar overfitting.

Here is the notebook I’ve been working on.

I’m out of ideas on what I should be trying to get different results. Would really like some feedback and suggestions.

The thing I do notice is that some of the predictions do include the right categories but get masked when I run class_scores.sigmoid > threshold. Lowering the threshold also ends up adding a lot of other noise.

Hi, Siddharth,

Under the initial assumption that your model and loss function are correct, what is striking in your notebook are the very low values you use for the learning rate. When you first freeze and use lr_find, you select lr=2.0e-3. That is Ok and it works, but i would have chosen a more aggressive lr=1.0e-2 and run it for about 10 epochs. In subsequent runs, however, the lr_find’s suggested point is meaningless. It makes you choose very small values for lr, and that is why the training cycle is stuck. To select a meaningful value for lr, go to the point in the graph that starts to turns up. Then, select a value by dividing it by 10 or 20. Only when you unfreeze, you should use a small value for the lr. Nevertheless, even when you are using very small values for lr, you are overfitting. My second suggestion is to verify the values of dropout and bias in the SSD head. The model is very sensitive to these values. A third suggestion is to monitor the regression_loss and clas_loss in the ssd_loss module. Print the values separately and verify the values are balanced. You use aggregate values for the losses. I prefer to use average values. A final suggestion is to use softmax instead of sigmoid when checking the threshold. When you have multiple categories, the softmax is more discriminative of those predictions with higher probabilities. Hope this helps


Thank you @joseadolfo I’ll try these suggestions and post an update.

Posting an update on my progress since the last week.

Finding the right learning rates has begun to appear like the search for the dragon scroll of limitless power.

  1. This is the closest I’ve got to get meaningful results but it felt like a fluke because I started with training all the layers since training the last or the last 2 led to everything being detected as a person and it became hard to recover from that model. It’s still far from being good.
  2. This approach involved using higher dropouts and bias enabled. The thing that I don’t understand is what does bias=true in the Conv2d do vs something like this
    a. My understanding of the bias is that it’s an additive parameter a*x + b and we ignore it in some cases. Why? When should we use it?
  3. Another mistake from my original post 6 days ago was a mistake I made in the loss function which I fixed was
def ssd_loss(pred, targ, target_class, debug=False): 
    pred_classes, pred_coords = pred       
    # For each set of 16x4 coords and 16x(num_classes) per image in a batch compute the loss
    regression_loss, class_loss = 0., 0.
    for p_cls, p_coords, t_coords, t_cls in zip(pred_classes, pred_coords, targ, target_class):        
        l1_loss, cls_loss = ssd_1_loss(p_cls, p_coords, t_cls, t_coords)
        regression_loss += l1_loss
        class_loss += cls_loss     
    if debug: print(f'regression_loss: {regression_loss}, class_loss: {class_loss}')     
    return regression_loss + class_loss

I was returning regression_loss + cls_loss incorrectly which was the loss of a single image with it’s 189 activations and class predictions.

Other things I tried were

  1. Using reduction='mean' instead of reduction='sum' this helps when the lr is not optimal and keeps the erratic shifts smaller.
  2. Choosing the learning rates as Joseadolfo suggested.
  3. Using softmax instead of sigmoid for computing threshold.
  4. Trying resnet50 with a slightly modified model.

I’m not sure what I’ve missed but would love to hear feedback and suggestions on the approaches I’ve tried. In case you have time here are the various attempts I made in trying to solve SSD with larger anchor grids.