[Kaggle] Discussion on Global Wheat Detection Challenge

This thread is for any discussion about the latest Kaggle competition – Global Wheat Detection Challenge.

It seems there are a number of fellows also playing the competition. I think it’s a good idea to have a thread dedicated to this competition, where we can share ideas/ tips/ useful resources, recruiting teammates, discussing how we could apply fastai2/ fastai on the competition, or troubleshoot bugs (whatever about the competition!).

On my side, I originally attempted to interface torchvision’s FasterRCNN to fastai2, but ended with a so-so performance and the interface is not working pretty well (many ugly tweaks involved). So right now I am trying to use fastai’s RetinaNet instead in the competition. For RetinaNet, I mainly take reference from this notebook from course v3. Though the notebook implements RetinaNet in fastai v1, I think it is a pretty good starting point.

I have spent quite some time understanding how RetinaNet works from this notebook (e.g. the format of bbox in model input and model output, post-processing steps involved on predicted bounding boxes in inference / training mode … etc).

My feeling so far on object detection models (such as FasterRCNN, RetinaNet) is that it involves A LOT OF DETAILS. Missing one of them could probably lead to a poor performance. I would like to share something you had better pay attention below:


  1. Pay extra attention on the bbox format of your data input v.s. the bbox format your model expect v.s. the bbox format the loss function expect! Taking RetinaNet from the above notebook as an example, its loss function expect the target bbox to have a rescaled TLBR format (i.e. [y0, x0, y1, x1] AND each numbers are rescaled to [-1, +1])! If the bbox of your data input is in COCO format (i.e. [x0, y0, w, h] and each numbers are not scaled) / PASCAL-VOC format (i.e. [x0, y0, x1, y1] and each numbers are not scaled), make sure you have transform them to the right format before feeding into loss function. Failing to do so will make your training loss curve hardly drop. (I made that mistake before)
  2. Check if the model count background into the number of class. Take RetinaNet as an example, background is excluded from the number of class. But for other model such as FasterRCNN / YOLO, it count background into the number of class (correct me if I am wrong).
  3. The mAP metrics the competition is using is DIFFERENT from the standard definition! See the resources below for the implementation of metrics, as a reference.

Resources I Found Useful

  1. [fastai] course v3 RetinaNet implementation
  2. [fastai] object detection using fastai2
  3. [kaggle kernel] an implementation of mAP metrics for this competition
  4. [kaggle kernel] training a keras RetinaNet

Do feel free to share your tips / resources, I am happy to consolidate them all here!


right now I have made an end-to-end pipeline for RetinaNet from training to kernel submission. From what I inspect on the test set performance, the prediction makes sense but the model is not doing good. Here are some snapshots:
1 2 3 4

I am looking for a directional improvement on that. It would also be great if any fellows could help point out potential issues wrt the model performance.


I, just like you, also tried to interface torchvision with fastai, and also failed…

I’m actually building a object detection library and have a starter kernel for fasterrcnn here =)

It’s a bit outdated, but I’ll push a new version soon.

The results are also not so great at this point, there is a lot of objects being detect multiple times, I still have to play around with the hyperparameters


great to see you are also working on this competition!
Your kernel looks great. Are you using original image size (1024 x 1024) for training?
I will stick with RetinaNet for a while, gonna try out a few hyperparameters (e.g. bias, alpha, gamma, detection threshold, and image resize)
Right now I am using 256 x 256 for training. As a result, output size of some feature maps from FPN is pretty small (e.g. 2x2, 4x4). With such a huge receptive field / grid size, I suspect it causes many wheats being missed in deeper FPN’s feature maps.

Do let me know if you find a directional improvement. :slightly_smiling_face:


I’m still using 1024 yes,

Sure! Please keep us posted as well =)

May I ask what hardware you use?

I train RetinaNet on server with a 1080 Ti. (11 GB)
I use 32 batch size, image resized to 256 x 256. The training used up ~10GB of GPU memory.
If you reduced the batch size to 16, you can train with <6 GB GPU memory (as far as I can remember)


A few update, this week I am mainly optimizing hyper-parameters. The hyper-parameters I consider are alpha, gamma, NMS threshold and bias (of last layer). Using validation set’s mAP as evaluation, I found the optimal hyper-parameters are quite different from the default provided:

  1. bias: -4 --> -2
  2. alpha: 0.25 --> 0.5
  3. gamma: 2 --> 1
  4. no change on NMS threshold (i.e. 0.3)

Different hyper-parameter setting could give wildly different mAP performance. From my experiments, it could account for a difference of 10%+ validation set’s mAP.

But unfortunately, such boost didn’t fully translate to public LB. Using the selected optimal hyper-parameters only improves ~4% mAP in public LB. (i.e. from 0.22 to 0.26) The performance is still poor. It does particularly bad on region with dense objects. Here is the prediction on the same test set before:

1 2 3 4

I needa keep finding ways to improve the imperformance. If it still fails, probably I gota change model. Any new insights are welcome.


Hey @riven314, thanks for keeping us posted.

The results I’m getting are also very similar to yours. I actually suspected that increasing NMS did not improved the results, sometimes the boxes get really crowded.

I see people on the leaderboard scoring ~70, why is that so different from what we’re getting ~24?

I have exactly the same question! A torchvision’s FasterRCNN baseline could easily get 60+ mAP in public LB, while we struggle to get a 20+ in public LB. My suspicion is that it must have something fundamentally wrong about our model configuration. My intuition is that “THAT THING” is something we gota look into even before looking into hyperparameters.

I guess “THAT THING” is the pretrained backbone we are using. For torchvision.FasterRCNN, it’s using a backbone pretrained on COCO.(i.e. a FasterRCNN pretrained on COCO with its predictor head cut and its FPN remained, by default the FPN is based on resnet50) For fastai2, by default it’s using ImageNet (i.e. a resnet34/ resnet50 pretrained on ImageNet with the deeper layers cut). And COCO is exactly an object detection task while ImageNet is about image classification.

In addition, the COCO pretrained model (Using ImageNet as pretrained to train COCO, and then using COCO as the pretrained) is taking one step ahead of our ImageNet pretrained model. (Using ImageNet as pretrained).

For those reasons, using COCO as pretrained may account for such a huge performance gap. (or sure if it makes sense to you)

Right now, I am trying to replace the pretrained model to be COCO’s one. Essentially i wanna take the FPN pretrained on COCO from torchvision.FasterRCNN, and then integrate that into my RetinaNet. But it seems to be not that straight-forward when I compare torchvision.FasterRCNN source code against fastai’s RetinaNet source code. Probably prebuilt fastai helper functions could not help me do the job, I needa refactor RetinaNet code to make it possible.


I’m using torchvision’s FasterRCNN, heh :laughing:

Maybe the metrics are calculated a bit differently? Have you tried submitting your results already?

Oh, then it’s so weird!
The mAP metrics is different from standard mAP, but I am already using the one with its definition.
Yes, I have submitted several times. When I integrate torchvision.FasterRCNN in fastai2, I got ~60% mAP in public LB. (it is still lower than baseline by 6%) But when I use RetinaNet, I miserably got ~20% mAP in public LB.


Now, the best I could get with RetinaNet is ~36% mAP in public LB.

I made the following changes to get the score:

  1. Using ResNet50 pretrained on COCO, instead of the one pretrained on ImageNet: it helps model converge faster and improves mAP by a small margin
  2. Using different anchor scales: previously I used [1., 0.6, 0.3], now I used [1., 2**(1/3), 2**(2/3)] (it’s the default from the paper). This change helps improve much of my mAP (by ~8% mAP)

However, it’s still very distant from a tolerable performance. I would spend a few more days trying to improve the performance of RetinaNet. If it still fails to achieve 60%+ in public LB, I would change my approach.

Another issue I face is that the mAP I get from my personal validation set is way higher than that I get in public LB (~8% mAP difference). Not sure if it has something wrong with my mAP metrics function, or it is a sign of overfitting.

Found another kernel that trained a keras RetinaNet for the competition.

From the prediction snapshots it offers, it seems to have a nice performance even in one epoch. But I notice it shares some issues that I also encountered in my own RetinaNet (e.g. poor performance in highly dense region). But the kernel give me some confidence that RetinaNet could do a reasonable job on this competition, the performance difference probably comes from the implementation details

A bit of my update, I kind of let go RetinaNet and took another model.
Right now, I am using EfficientDet in PyTorch (based on a popular kernel notebook)
It gives a pretty decent performance – ~0.70 public LB for 512x512 image, and ~0.58 public LB for 256x256, it seems image size has a huge impact on the performance.
I think I will integrate fastai2’s training loop in my codebase later


@lgvaz I see you have implemented a custom parser for the wheat dataset in icevision, and I am hoping you can provide a link to a jupyter notebook showing its use for this challenge?

Update: discovered https://www.kaggle.com/quazar/fasterrcnn-mantisshrimp-starter where it appears mantisshrimp is a fork of icevision

1 Like