Detecting Coconut Trees from the Air with (notebooks & dataset available)

(Dave Luo) #1

Hi all,

I’ve been working on adapting lessons 8-9 (primarily pascal-multi) to work with a new dataset and object detection challenge, namely detecting food-producing trees (like coconuts) from an aerial view (drones in this case; would work with satellite imagery too).

I wanted to:

  • change the image set and classes from PASCAL or COCO to something appreciably different (top-down detection of coconut trees for now, other tree types and building/road segmentation later),
  • adapt raw aerial imagery data and annotations of a geospatial nature to the correct input formats for object detection,
  • practice pre-processing generally messier data than the usual kaggle challenge or academic dataset
  • most importantly, trying to achieve great performance with fastai!

Here’s the performance punchline upfront:
Using nearly default settings from the pascal-multi model, some predictions (column 1) compared to ground truth bboxes and the plain image (columns 2 and 3):

Note that in the last 2 examples, the model correctly detected coconut trees which were incorrectly unlabeled by human annotators in the supposed “ground truth”!

Also note that my bounding boxes labels are synthetic: I auto-created them as 90x90 squares (or rectangles at the borders if a coord <0 or >224) using the human-annotated point coordinate of each tree as the bbox center. This works well enough since most trees are roughly the same size but occasionally they are bigger or smaller than 90x90. In some examples, it seems like the model actually does a better job of finding the “real” bbox of each tree than my synthetic bboxes but that’s not correctly baked into the loss score optimization so perhaps they’re less “correct” (w.r.t. the loss function) predictions that happen to match up with the real-world tree sizes. OR maybe they are well-trained predictions because the vast majority of trees do size up to be ~90x90 so the occasional larger or smaller tree isn’t enough of a penalty to throw off the overall training objective.

I’ve put my latest model notebook, ready-to-train dataset, and preprocessing workflow docs in a github repo. More details and documentation available there for anyone who’d like to take this model and dataset for a spin or adapt it to their own work:

direct link to nb:

direct link to dataset d/l (4036 jpeg images + mc and mbb label csv files, 50MB):

download latest training weights (so far) to go straight to inference (93MB, put into your models/ folder and learn.load() in the final step before NMS):

Notebooks are early Works-In-Progress (in need of refactoring among other things) so I would appreciate any and all questions, suggestions, collaboration!

I plan to keep building on this dataset and improving models for more/better multi-object detection. The dataset is also applicable for later lessons like semantic segmentation of building and road footprints (those pixel-level annotations are also available as shp files so they need preprocessing).


btw, for those interested in learning more about this work, from :

Disasters in the South Pacific are a reality. In the past 10 years, major Cyclones have seriously affected hundreds of islands across Fiji, Tonga, Vanuatu and Samoa to name a few; disrupting millions of lives and causing millions of dollars of damage. Many of the countries in the Pacific region are also exposed to other high risk disasters including earthquakes, tsunami, storm surge, volcanic eruptions, landslides and droughts, not to mention the growing threat of Climate Change. What does all this have to do with Artificial Intelligence (AI)?

Aerial imagery is “Big Data” challenge. We’ve observed this challenge repeatedly over the years, and most recently again during our work with UNICEF in Malawi. It took hours to manually analyze just a few hundred high-resolution aerial images from the field. In 2015, following Cyclone Pam in Vanuatu, it took days to manually analyze thousands of aerial images, which is why we eventually resorted to crowdsourcing. But this too took days because we first had to upload all the imagery to the cloud. I started working on this Big (Aerial) Data problem back in 2014 and am thrilled to dive back into this space with friends at the World Bank and OpenAerialMap (OAM). By “this space” I mean the use of machine learning and computer vision to automatically identify features of interest in aerial imagery.

Full details about the challenge and dataset:

Mean average precision (mAP)
Exposing DL models as api's/microservices
(Jeremy Howard (Admin)) #2

Results look very encouraging!

(Dave Luo) #3

After implementing the mAP metric and ideas like more anchor boxes at smaller scales (28x28, 14x14, 7x7), fixing flatten_conv, and 1cycle training (use_clr_beta), here is the best model performance to date (& notebook link):

  • Average Precision @ IoU of 50%: .81
  • F1 score: .83


I find the below figure more intuitive to understand the balance between precision and recall at different confidence thresholds (F1 score = green dotted line):


The detector pretty much only misses trees at corners and edges of an image (where it’s only 25-50% view of a tree) or where there are clusters of multiple different-sized trees. Since these are tiled aerial images, we can likely further improve performance in post-process by stitching images and their predictions back together and removing repetitive predictions along the seams of tiles using another round of non-max suppression:

I’m testing out Feature Pyramid Network / RetinaNet-like implementations now but haven’t found performance improvement from them yet (perhaps because there’s not as big of a multi-scale problem with these top-down images taken from a consistent height - objects generally stay the same size). But FPNs will come in very handy when dealing with aerial imagery taken from different instruments with varying levels of spatial resolution.

(Rohit Singh) #4

Thanks for sharing this @daveluo. I tried this out as well though I only did what Jeremy covered in the class. My notebooks are here. You’ve added a lot of goodies on top of pascal-multi that will help me and others learn a lot more - thanks!

One of the things I’d like to try is increase the size of the image chips (from the current 224), so the network could look at larger areas in one shot. Do you have any suggestions or ideas on how large we could go and what the implications would be?

Your observation that feature pyramids might not help as much as all objects are of a similar size in satellite images is spot on. I was wondering how the network could take advantage of this fact…

(Dave Luo) #5

Thanks for checking out and working on the dataset!

Re: increasing the chip sizes, yes I think that would work fine up until we start hitting memory errors during training (you could decrease batch size to compensate) and there’s a speed trade-off. I think RetinaNet uses a range of image sizes from 400-800px (on the shorter dimension if rectangular):


One other thought is that we could vary the zoom on the larger source tiles when preprocessing and keep the input size at 224x224 to preserve speed and memory. I imagine we could zoom out as far as there’s still enough resolution to make out the details of a tree of the smallest size. I.e. if a tree is roughly 80x80 pixels at our current zoom level, we could zoom out to 4x so the same tree is now 20x20 on the same 224x224 sized chip. We would need to use smaller-sized grid cells and anchor boxes to detect them, like starting with a 28x28 grid. This should let us cover more ground at the same speed (with probably a small hit on detection performance due to less visible details per tree).

We could also vary zooms and crops more dramatically as a form of data augmentation, in which case a FPN also becomes more useful.

EDIT: please disregard my earlier comment about changing your flatten_conv function. You have it correct as is.

(Dave Luo) #6

@rohitgeo - I tried your idea of using a larger chip size. In my case, I went with 600x600, kept the input sz=600 for the model so there’s no downsampling, and decreased to bs=8 for mem reasons. I had to adjust the architecture to use a different set of grid cell sizes (because 600px strides down to 300, 150, 75, 38, 19, 10, 5, 3). Notebook link.

The detection performance is much worse:

AP: .446, max F1 score: .624

Specifically, recall suffers heavily - maxing out under 60%. I think this is because we are now trying to detect many more objects of smaller size within each image. Both localization and classification become more difficult than detecting a few “main subject” objects. I’ve seen this as well with the PASCAL dataset: i.e. detecting an airplane, train, or horse as the main subject of the photo is easier than a scattering of sheep in the distance of a pastoral landscape image.

We can see how this play out when visualizing predictions, particularly in examples where there are clusters of many coconut trees against a background of other trees:

Lots of FNs and FPs.

It’s less of a problem where there are individual trees standing apart but still not great performance:

I haven’t tried optimizing the model and hyperparameters for 600x600 input but I imagine the root problem of having many small objects will keep detection performance down. Rather than go that route, I’d suggest making it an easier problem for the detector by choosing a chip size where you have 1-5 trees per chip vs >5 trees/chip. A benefit of most aerial imagery is that this is well within our control in preprocessing.

BTW, sidenote but another nuance of this particular dataset is that there are large swaths of the dataset where trees are incorrectly or un-labeled. See example below:


Red dots are the labeled coconut trees but there are clearly coconut trees on the right half that were missed by the labelers (I think they mislabeled these as a different type of tree). I have a preprocessing step that filters out all chips without any labeled trees - with the 224x224 chips, this had the beneficial side effect of removing most of these incorrectly labeled areas from the training set. With 600x600 chips, it’s easier for these mislabeled areas to be included in training because most chips will cover enough ground that there’s at least 1 correctly labeled tree in there. The inclusion of these chips with a lot of mislabeled trees confuses our model training and worsens performance as well.

(Kevin Bird) #7

Maybe these weren’t detected because of the brown around the tree. I could see that causing a model to screw up. How many coconut trees from the training dataset have dirt surrounding them? Maybe this would be a good use-case for the Online Batch Selection to help show the brown landscape more often in the training of the model.

(Rohit Singh) #8

@daveluo that’s a very thorough analysis with larger chip sizes. Thank you for trying it out and sharing your insights. Your precision-recall curve and AP, F1 score metrics make the job of comparing two models a lot easier!

In the approach shared at, the authors tried these techniques:

  • running the detector at two different scales (likely 224 and 500 pixels wide) to detect objects of varying sizes
  • upsample via a sliding window to look for small, densely packed objects - I think that’s the same as the first point above
  • denser final grid - i.e. more anchor boxes, like you did.

They didn’t get great result with closely packed objects either. The unlabeled/mislabeled trees in the dataset will also be contributing. I see some of those in the results as well.

(hari rajeev) #9

building a model to count the number of coconuts in a tree … i know such a thing will be helpful for end users .