Implementing Mask R-CNN

brendan · March 28, 2017, 8:08pm

While discussing our Semantic Transfer demo, @Even brought to my attention Mask R-CNN, a new paper from Facebook AI. A few of you have expressed interest in trying to implement this (@Matthew, @sravya8, @jeremy), so I wanted to use this thread to share our progress toward an implementation. This post is a wiki, so feel free to make updates as our understanding improves.

Here are my initial notes on the various components we need to understand and implement.

Key Links

Mask R-CNN Paper
Great survey paper of current techniques for object detection
CS231 Object Detection Slides and Video
CS231 Segementation Slides and Video

Related Papers

The building blocks of Mask R-CNN. Papers and Githubs (I have not closely reviewed the code).

ResNet - Paper, Keras, Pytorch
ResNext - Paper, Github (Not required, but improves performance)
R-CNN - Paper
Fast R-CNN - Paper, Slides, Pycaffe, Tensorflow
Faster R-CNN - Paper, Slides, Pycaffe, Pytorch, Tensorflow
RoIPooling - Pytorch, Theano
R-FCN - Paper, Pycaffe
FPN (Feature Pyramid Network) - Paper
Spacial Transformer Networks - Paper, PyTorch, Tensorflow, Tutorial

Key Mask R-CNN Improvements

RoIAlign Layer - Improved version of RoIPool Layer
Mask Branch - Segmentation prediction on Region on Interest in parallel with classification/detection
Decouple Mask and Class Prediction using Binary Sigmoid Activations vs Multi-class Softmax

Implementation Details

Quick facts I was able to extract from a cursory review

Two part architecture

Feature Extraction (processing image and save activation at specific layer)
Segmentation (bounding box, class, mask prediction on a “Region of Interest”)

Feature Extraction Models They Tried

ResNet and ResNext at depths of 50 or 101. Extracted the activations at the final convolutional layer at the 4th stage: “C4”
Feature Pyramid Network (FPN) in combination with ResNet

Training Parameters

Training Set: 80K, Validation Set: 35K, MiniValidation Set (Ablations): 5K
160K training iterations (160K mini-batches)
Training time: 32 hours, 8 GPU machine, with ResNet-50-FPN architecture
Learning Rate 0.02 until 120K iterations, then reduced to 0.002
Single Image Segmentation time: 200ms on 1 Tesla M40 GPU
Weight decay: .0001
Momentum: 0.9
Mini-batches of 2 images per GPU
Resized inputs so shorted edge (width/height) was 800 pixels.

Loss Function

Loss Function = Loss_class + Loss_box + Loss_mask
Mask loss only considered for the ground truth label
Average Binary Cross-Entropy Loss
Per-Pixel Sigmoid Activation
Decision Boundary 0.5 (for class prediction)

Dataset

Coco Objects in Context
http://mscoco.org/dataset
Paper explaining dataset
200K annotated images
80 instance categories
1.2M instances
Torrent Download?

brendan · March 28, 2017, 8:22pm

First question:
Should we implement this is Keras? Pytorch? Tensorflow? Or some combination? The authors are from facebook, so it’s likely their implementation will be in Torch, PyTorch, or Pycaffe. The main author released his ResNext code in Torch, for example. Another released her last project in Caffe.

In light of this, perhaps it makes most sense to go the tensorflow/keras direction?

ibarinov · March 28, 2017, 8:50pm

Brendan,

Today will be a study group “Study Group Session: Mask R-CNN” at Deep Learning Study Group (San Francisco). You could sign-up here https://www.meetup.com/deep-learning-sf/events/238619304/?rv=cr1&_af=event&_af_eid=238619304&https=on

brendan · March 28, 2017, 8:51pm

I did! But I got stuck on the waiting list.

ibarinov · March 28, 2017, 9:02pm

I never saw a full room there and check-in so you’ll be good with the USF badge.

I’d like to use NN-segmentation, e.g. Mask R-CNN to segment see lions in NOAA Fisheries Steller Sea Lion Population Count.

Here is an example of TrainDotted images from Train dataset https://transfer.sh/p0K5B/0.jpg
Zoom it to check dots.

Keys:
red: adult males
magenta: subadult males
brown: adult females
blue: juveniles
green: pups

brendan · March 28, 2017, 9:07pm

I was just looking at that competition! Maybe we can work together

But Mission Hall is UCSF? I’m completely confused.

ibarinov · March 28, 2017, 9:12pm

Oh, you are right. Plan B. I’ve asked organizer of the meetup, Jim Fleming if he could promote you from the wait list.

jeremy · March 28, 2017, 9:21pm

Would be great if you guys can take some good notes from the study group.

PyTorch seems to have more of the existing pieces already in place. Particularly RoiPooling. I think PyTorch is generally easier for prototyping things, and I like the torch-vision library. I’m happy to use tensorflow though - with whatever amount of keras is appropriate.

Presumably we don’t need to use resnext - that was just something they tried. So what’s the minimal implementation of Mask R-CNN require?

davecg · March 28, 2017, 9:39pm

I wonder how difficult t would be to add PyTorch as an extra backend to Keras. Even with its new integration into the main TensorFlow project the plan is still to keep it backend agnostic.

brendan · March 28, 2017, 9:51pm

I’m curious to know what role segmentation would play in this project? The existing datasets don’t have Sea Lions as far as I can tell. And the sea lions look like tiny dots given the aerial photos. I wonder how far a vanilla VGG with transfer learning could go.

ImageNet has Seals, though.
http://image-net.org/challenges/LSVRC/2016/#loc

Would it require hand-annotation?

ibarinov · March 28, 2017, 9:59pm

a. “Jim Fleming 2:50 PM
Hi Igor, yep, as long as they’re on the waitlist they can attend.”

b. I thought about pipeline like:
Segmentation -> Super Resolution on each animal-> Classification

ImageNet has a label n02077923 for Sea Lions
https://transfer.sh/VztbP/n02077923.tar (153M)

c. (just an idea) we are in California and there are a lot of sea lions around. we could make photos of sealions from a drone and train NN on this new category. And use it for super resolution after

brendan · March 28, 2017, 10:04pm

Hahah love it!

The super-resolution idea is interesting and reminds me of this StackGAN paper where the authors generate small, low res images with the GAN and then post-process them with super resolution to produce crazy good looking images.

Thanks for reaching out for me! I’ll be there.

brendan · March 29, 2017, 6:08pm

I think the consensus last night was we should have read the 5 supporting papers Mast R-CNN builds on first. But I did meet the 2nd place winner of the Kaggle Whale spotting competition! Now he works at a medical imaging company using deep learning! In addition, one other participant introduced himself as “I’m here tonight because I’m interested in combining neural style transfer with object segmentation.” So I laughed at that.

To your question, “what is the bare minimum we need to implement?” I’d say we need to implement just enough to benchmark our results against theirs. And I agree we should do it in PyTorch. I also just found this WIP Tensorflow implementation we can reference.

What’s bare minimum?

To get their best result for image segmentation, we need ResNext-101 with Feature Pyramid Networks (another brand new paper), however they provide benchmarks for Resnet-50 and Resnet-101 that use both FPN or C4 features, so we can cut corners and still have something to benchmark.

In terms of object detection they actually break it down and analyze the impact of each of the components

The gains of Mask R-CNN over [21] come from using RoIAlign (+1.1 APbb), multitask training (+0.9 APbb), and ResNeXt-101 (+1.6 APbb)

Even though ResNext isn’t introduced this paper, it generates the most improvement ironically.

50 vs 101 Layers
Let’s use 101 since it gives +2.0 AP improvement and shouldn’t be too hard to handle. Pytorch has a nice built in ResNet model with both 50 and 101 layers supported.

ResNet vs ResNext
We can prototype with ResNet and sacrifice +1.3 AP. But later we can try this script which ports the weights AND source code of ResNext from Torch to PyTorch. They report success on porting ResNext.

C4 vs FPN Features
It’s not required, but using FPN features is critical for both accuracy and speed. According to the authors: "We report that ResNet-101-C4 takes ∼400ms (vs ~200ms for FPN) as it has a heavier box head, so we do not recommend using the C4 variant in practice.

RoIAlign
We need to implement this to properly benchmark their result.

That’s all for now. More soon.

iNLyze · March 29, 2017, 7:22pm

@brendan, thanks for bringing this up! This is going to be a blast, when you get it working. The sea lion contest is definitely one for segmentation, because of counting. I wonder how Mask R-CNN compares to other segmentation architectures. They did cite Spatial Pyramid Pooling, due to RoiPooling operation, but I didn’t find a reference to DenseNet yet.
Boy, the speed of this is blowing my mind. Again, Supernova.

davecg · March 29, 2017, 8:40pm

I actually have some code for spatial pyramid pooling in Keras using tf backend. Can share it when I’m back at my computer.

iNLyze · March 29, 2017, 8:45pm

This one or other resource?

davecg · March 29, 2017, 8:50pm

Haha, no, made my own but looks like it’s basically the same. (I didn’t include ‘th’ ordering since I wasn’t using it.)

Don’t think that was there when I wrote it.

brendan · March 29, 2017, 9:27pm

Mindi was able to secure Room 451 for us on Friday from 11am - 5pm. She couldn’t get something for 10am, but we can plan to meet on the 5th floor at 10am and then migrate to room 451 at 11.

jeremy · March 29, 2017, 10:46pm

OK, so I read the paper. And I came to the same conclusion as @brendan said the reading group did - we need to master the pieces before we can implement the paper.

So, I’m wondering if we’ve bitten off quite a bit more than we can chew in a day! I wouldn’t want to get to the end of the day (plus, perhaps, the weekend) and find that we hadn’t achieved anything we were excited about. So I can think of two approaches:

Start with something simpler at the start of the day, and make sure we can at least finish that. Specifically, I’m thinking The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation
Or, start with one sub-piece of the full mask R-CNN, such as the region proposal network (RPN)

There’s already a faster R-CNN implementation (and therefore RPN) at https://github.com/yhenon/keras-frcnn and one for pytorch at https://github.com/longcw/faster_rcnn_pytorch . So I’m less excited about this.

OTOH, I’m really excited about doing the tiramisu! Here’s my pitch:

There are already 2 densenet pytorch implementations, and they both look pretty good, but there’s no tiramisu implementation
I’ve been hoping to teach the use of skip connections in segmentation in this part of the course, so this would give me just what I need
As well as teaching tiramisu, this would be a great excuse to teach densenet too!
The results from densenet and tiramisu are both state of the art, yet are easy to understand
@yad.faeq has already started an (incomplete) keras port, so maybe he can help us too…

If we get it done, we could then move on to parts of mask r-cnn later in the day or over the weekend, if there’s interest.

What do you all think?

brendan · March 29, 2017, 11:22pm

The more I read the more I agree with you. I’m partial to implementing a single feature of Mask R-CNN however because the architecture allows for instance segmentation, which helps our video style transfer. Tiramisu provides pixel-by-pixel semantic segmentation, so all cats would be lava red ;).

Tiramisu doesn’t report results on mscoco, so it’s hard to compare accuracy. I wasn’t able to find anything about training or prediction times either? Last night I asked about Tiramisu and the event organizer told me he wasn’t able to replicate their results (he had difficulty converging). He mentioned their model was much harder to train. This is anecdotal, so I can’t confirm. How do DenseNets compare in terms of trainability?

I propose we take the existing Faster R-CNN PyTorch implementation and add a single feature on top. For example, the authors of Mask R-CNN provide benchmarks for Faster R-CNN with RoIAlign only, so we could focus on only implementing RoIAlign? This way we start with a working model (our unit test) and keeping breaking it until we get it to work again.

However, I have a very cursory understanding of these things, so I’ll leave it to you and the others to decide.