Implementing Mask R-CNN

While discussing our Semantic Transfer demo, @Even brought to my attention Mask R-CNN, a new paper from Facebook AI. A few of you have expressed interest in trying to implement this (@Matthew, @sravya8, @jeremy), so I wanted to use this thread to share our progress toward an implementation. This post is a wiki, so feel free to make updates as our understanding improves.

Here are my initial notes on the various components we need to understand and implement.

Key Links

Related Papers

The building blocks of Mask R-CNN. Papers and Githubs (I have not closely reviewed the code).

Key Mask R-CNN Improvements

  • RoIAlign Layer - Improved version of RoIPool Layer
  • Mask Branch - Segmentation prediction on Region on Interest in parallel with classification/detection
  • Decouple Mask and Class Prediction using Binary Sigmoid Activations vs Multi-class Softmax

Implementation Details

Quick facts I was able to extract from a cursory review

Two part architecture

  1. Feature Extraction (processing image and save activation at specific layer)
  2. Segmentation (bounding box, class, mask prediction on a “Region of Interest”)

Feature Extraction Models They Tried

  • ResNet and ResNext at depths of 50 or 101. Extracted the activations at the final convolutional layer at the 4th stage: “C4”
  • Feature Pyramid Network (FPN) in combination with ResNet

Training Parameters

  • Training Set: 80K, Validation Set: 35K, MiniValidation Set (Ablations): 5K
  • 160K training iterations (160K mini-batches)
  • Training time: 32 hours, 8 GPU machine, with ResNet-50-FPN architecture
  • Learning Rate 0.02 until 120K iterations, then reduced to 0.002
  • Single Image Segmentation time: 200ms on 1 Tesla M40 GPU
  • Weight decay: .0001
  • Momentum: 0.9
  • Mini-batches of 2 images per GPU
  • Resized inputs so shorted edge (width/height) was 800 pixels.

Loss Function

  • Loss Function = Loss_class + Loss_box + Loss_mask
  • Mask loss only considered for the ground truth label
  • Average Binary Cross-Entropy Loss
  • Per-Pixel Sigmoid Activation
  • Decision Boundary 0.5 (for class prediction)


Coco Objects in Context
Paper explaining dataset
200K annotated images
80 instance categories
1.2M instances
Torrent Download?


First question:
Should we implement this is Keras? Pytorch? Tensorflow? Or some combination? The authors are from facebook, so it’s likely their implementation will be in Torch, PyTorch, or Pycaffe. The main author released his ResNext code in Torch, for example. Another released her last project in Caffe.

In light of this, perhaps it makes most sense to go the tensorflow/keras direction?


Today will be a study group “Study Group Session: Mask R-CNN” at Deep Learning Study Group (San Francisco). You could sign-up here

I did! But I got stuck on the waiting list. :frowning:

I never saw a full room there and check-in so you’ll be good with the USF badge.

I’d like to use NN-segmentation, e.g. Mask R-CNN to segment see lions in NOAA Fisheries Steller Sea Lion Population Count.

Here is an example of TrainDotted images from Train dataset
Zoom it to check dots.

red: adult males
magenta: subadult males
brown: adult females
blue: juveniles
green: pups

1 Like

I was just looking at that competition! Maybe we can work together :slight_smile:

But Mission Hall is UCSF? I’m completely confused.

Oh, you are right. Plan B. I’ve asked organizer of the meetup, Jim Fleming if he could promote you from the wait list.

1 Like

Would be great if you guys can take some good notes from the study group.

PyTorch seems to have more of the existing pieces already in place. Particularly RoiPooling. I think PyTorch is generally easier for prototyping things, and I like the torch-vision library. I’m happy to use tensorflow though - with whatever amount of keras is appropriate.

Presumably we don’t need to use resnext - that was just something they tried. So what’s the minimal implementation of Mask R-CNN require?

1 Like

I wonder how difficult t would be to add PyTorch as an extra backend to Keras. Even with its new integration into the main TensorFlow project the plan is still to keep it backend agnostic.

I’m curious to know what role segmentation would play in this project? The existing datasets don’t have Sea Lions as far as I can tell. And the sea lions look like tiny dots given the aerial photos. I wonder how far a vanilla VGG with transfer learning could go.

ImageNet has Seals, though.

Would it require hand-annotation?

a. “Jim Fleming 2:50 PM
Hi Igor, yep, as long as they’re on the waitlist they can attend.”

b. I thought about pipeline like:
Segmentation -> Super Resolution on each animal-> Classification

ImageNet has a label n02077923 for Sea Lions (153M)

c. (just an idea) we are in California and there are a lot of sea lions around. we could make photos of sealions from a drone and train NN on this new category. And use it for super resolution after

1 Like

Hahah love it!

The super-resolution idea is interesting and reminds me of this StackGAN paper where the authors generate small, low res images with the GAN and then post-process them with super resolution to produce crazy good looking images.

Thanks for reaching out for me! I’ll be there.


I think the consensus last night was we should have read the 5 supporting papers Mast R-CNN builds on first. But I did meet the 2nd place winner of the Kaggle Whale spotting competition! Now he works at a medical imaging company using deep learning! In addition, one other participant introduced himself as “I’m here tonight because I’m interested in combining neural style transfer with object segmentation.” So I laughed at that.

To your question, “what is the bare minimum we need to implement?” I’d say we need to implement just enough to benchmark our results against theirs. And I agree we should do it in PyTorch. I also just found this WIP Tensorflow implementation we can reference.

What’s bare minimum?

To get their best result for image segmentation, we need ResNext-101 with Feature Pyramid Networks (another brand new paper), however they provide benchmarks for Resnet-50 and Resnet-101 that use both FPN or C4 features, so we can cut corners and still have something to benchmark.

In terms of object detection they actually break it down and analyze the impact of each of the components

The gains of Mask R-CNN over [21] come from using RoIAlign (+1.1 APbb), multitask training (+0.9 APbb), and ResNeXt-101 (+1.6 APbb)

Even though ResNext isn’t introduced this paper, it generates the most improvement ironically.

50 vs 101 Layers
Let’s use 101 since it gives +2.0 AP improvement and shouldn’t be too hard to handle. Pytorch has a nice built in ResNet model with both 50 and 101 layers supported.

ResNet vs ResNext
We can prototype with ResNet and sacrifice +1.3 AP. But later we can try this script which ports the weights AND source code of ResNext from Torch to PyTorch. They report success on porting ResNext.

C4 vs FPN Features
It’s not required, but using FPN features is critical for both accuracy and speed. According to the authors: "We report that ResNet-101-C4 takes ∼400ms (vs ~200ms for FPN) as it has a heavier box head, so we do not recommend using the C4 variant in practice.

We need to implement this to properly benchmark their result.

That’s all for now. More soon.


@brendan, thanks for bringing this up! This is going to be a blast, when you get it working. The sea lion contest is definitely one for segmentation, because of counting. I wonder how Mask R-CNN compares to other segmentation architectures. They did cite Spatial Pyramid Pooling, due to RoiPooling operation, but I didn’t find a reference to DenseNet yet.
Boy, the speed of this is blowing my mind. Again, Supernova.

I actually have some code for spatial pyramid pooling in Keras using tf backend. Can share it when I’m back at my computer.

This one or other resource?

1 Like

Haha, no, made my own but looks like it’s basically the same. (I didn’t include ‘th’ ordering since I wasn’t using it.)

Don’t think that was there when I wrote it. :slight_smile:

Mindi was able to secure Room 451 for us on Friday from 11am - 5pm. She couldn’t get something for 10am, but we can plan to meet on the 5th floor at 10am and then migrate to room 451 at 11.


OK, so I read the paper. And I came to the same conclusion as @brendan said the reading group did - we need to master the pieces before we can implement the paper.

So, I’m wondering if we’ve bitten off quite a bit more than we can chew in a day! I wouldn’t want to get to the end of the day (plus, perhaps, the weekend) and find that we hadn’t achieved anything we were excited about. So I can think of two approaches:

There’s already a faster R-CNN implementation (and therefore RPN) at and one for pytorch at . So I’m less excited about this.

OTOH, I’m really excited about doing the tiramisu! Here’s my pitch:

  • There are already 2 densenet pytorch implementations, and they both look pretty good, but there’s no tiramisu implementation
  • I’ve been hoping to teach the use of skip connections in segmentation in this part of the course, so this would give me just what I need
  • As well as teaching tiramisu, this would be a great excuse to teach densenet too!
  • The results from densenet and tiramisu are both state of the art, yet are easy to understand
  • @yad.faeq has already started an (incomplete) keras port, so maybe he can help us too…

If we get it done, we could then move on to parts of mask r-cnn later in the day or over the weekend, if there’s interest.

What do you all think?


The more I read the more I agree with you. I’m partial to implementing a single feature of Mask R-CNN however because the architecture allows for instance segmentation, which helps our video style transfer. Tiramisu provides pixel-by-pixel semantic segmentation, so all cats would be lava red ;).

Tiramisu doesn’t report results on mscoco, so it’s hard to compare accuracy. I wasn’t able to find anything about training or prediction times either? Last night I asked about Tiramisu and the event organizer told me he wasn’t able to replicate their results (he had difficulty converging). He mentioned their model was much harder to train. This is anecdotal, so I can’t confirm. How do DenseNets compare in terms of trainability?

I propose we take the existing Faster R-CNN PyTorch implementation and add a single feature on top. For example, the authors of Mask R-CNN provide benchmarks for Faster R-CNN with RoIAlign only, so we could focus on only implementing RoIAlign? This way we start with a working model (our unit test) and keeping breaking it until we get it to work again.

However, I have a very cursory understanding of these things, so I’ll leave it to you and the others to decide.