Implementing Mask R-CNN

Would be great if you guys can take some good notes from the study group.

PyTorch seems to have more of the existing pieces already in place. Particularly RoiPooling. I think PyTorch is generally easier for prototyping things, and I like the torch-vision library. I’m happy to use tensorflow though - with whatever amount of keras is appropriate.

Presumably we don’t need to use resnext - that was just something they tried. So what’s the minimal implementation of Mask R-CNN require?

1 Like

I wonder how difficult t would be to add PyTorch as an extra backend to Keras. Even with its new integration into the main TensorFlow project the plan is still to keep it backend agnostic.

I’m curious to know what role segmentation would play in this project? The existing datasets don’t have Sea Lions as far as I can tell. And the sea lions look like tiny dots given the aerial photos. I wonder how far a vanilla VGG with transfer learning could go.

ImageNet has Seals, though.
http://image-net.org/challenges/LSVRC/2016/#loc

Would it require hand-annotation?

a. “Jim Fleming 2:50 PM
Hi Igor, yep, as long as they’re on the waitlist they can attend.”

b. I thought about pipeline like:
Segmentation -> Super Resolution on each animal-> Classification

ImageNet has a label n02077923 for Sea Lions
https://transfer.sh/VztbP/n02077923.tar (153M)

c. (just an idea) we are in California and there are a lot of sea lions around. we could make photos of sealions from a drone and train NN on this new category. And use it for super resolution after

1 Like

Hahah love it!

The super-resolution idea is interesting and reminds me of this StackGAN paper where the authors generate small, low res images with the GAN and then post-process them with super resolution to produce crazy good looking images.

Thanks for reaching out for me! I’ll be there.

4 Likes

I think the consensus last night was we should have read the 5 supporting papers Mast R-CNN builds on first. But I did meet the 2nd place winner of the Kaggle Whale spotting competition! Now he works at a medical imaging company using deep learning! In addition, one other participant introduced himself as “I’m here tonight because I’m interested in combining neural style transfer with object segmentation.” So I laughed at that.

To your question, “what is the bare minimum we need to implement?” I’d say we need to implement just enough to benchmark our results against theirs. And I agree we should do it in PyTorch. I also just found this WIP Tensorflow implementation we can reference.

What’s bare minimum?

To get their best result for image segmentation, we need ResNext-101 with Feature Pyramid Networks (another brand new paper), however they provide benchmarks for Resnet-50 and Resnet-101 that use both FPN or C4 features, so we can cut corners and still have something to benchmark.

In terms of object detection they actually break it down and analyze the impact of each of the components

The gains of Mask R-CNN over [21] come from using RoIAlign (+1.1 APbb), multitask training (+0.9 APbb), and ResNeXt-101 (+1.6 APbb)

Even though ResNext isn’t introduced this paper, it generates the most improvement ironically.

50 vs 101 Layers
Let’s use 101 since it gives +2.0 AP improvement and shouldn’t be too hard to handle. Pytorch has a nice built in ResNet model with both 50 and 101 layers supported.

ResNet vs ResNext
We can prototype with ResNet and sacrifice +1.3 AP. But later we can try this script which ports the weights AND source code of ResNext from Torch to PyTorch. They report success on porting ResNext.

C4 vs FPN Features
It’s not required, but using FPN features is critical for both accuracy and speed. According to the authors: "We report that ResNet-101-C4 takes ∼400ms (vs ~200ms for FPN) as it has a heavier box head, so we do not recommend using the C4 variant in practice.

RoIAlign
We need to implement this to properly benchmark their result.

That’s all for now. More soon.

4 Likes

@brendan, thanks for bringing this up! This is going to be a blast, when you get it working. The sea lion contest is definitely one for segmentation, because of counting. I wonder how Mask R-CNN compares to other segmentation architectures. They did cite Spatial Pyramid Pooling, due to RoiPooling operation, but I didn’t find a reference to DenseNet yet.
Boy, the speed of this is blowing my mind. Again, Supernova.

I actually have some code for spatial pyramid pooling in Keras using tf backend. Can share it when I’m back at my computer.

This one or other resource?

1 Like

Haha, no, made my own but looks like it’s basically the same. (I didn’t include ‘th’ ordering since I wasn’t using it.)

Don’t think that was there when I wrote it. :slight_smile:

Mindi was able to secure Room 451 for us on Friday from 11am - 5pm. She couldn’t get something for 10am, but we can plan to meet on the 5th floor at 10am and then migrate to room 451 at 11.

2 Likes

OK, so I read the paper. And I came to the same conclusion as @brendan said the reading group did - we need to master the pieces before we can implement the paper.

So, I’m wondering if we’ve bitten off quite a bit more than we can chew in a day! I wouldn’t want to get to the end of the day (plus, perhaps, the weekend) and find that we hadn’t achieved anything we were excited about. So I can think of two approaches:

There’s already a faster R-CNN implementation (and therefore RPN) at https://github.com/yhenon/keras-frcnn and one for pytorch at https://github.com/longcw/faster_rcnn_pytorch . So I’m less excited about this.

OTOH, I’m really excited about doing the tiramisu! Here’s my pitch:

  • There are already 2 densenet pytorch implementations, and they both look pretty good, but there’s no tiramisu implementation
  • I’ve been hoping to teach the use of skip connections in segmentation in this part of the course, so this would give me just what I need
  • As well as teaching tiramisu, this would be a great excuse to teach densenet too!
  • The results from densenet and tiramisu are both state of the art, yet are easy to understand
  • @yad.faeq has already started an (incomplete) keras port, so maybe he can help us too…

If we get it done, we could then move on to parts of mask r-cnn later in the day or over the weekend, if there’s interest.

What do you all think?

4 Likes

The more I read the more I agree with you. I’m partial to implementing a single feature of Mask R-CNN however because the architecture allows for instance segmentation, which helps our video style transfer. Tiramisu provides pixel-by-pixel semantic segmentation, so all cats would be lava red ;).

Tiramisu doesn’t report results on mscoco, so it’s hard to compare accuracy. I wasn’t able to find anything about training or prediction times either? Last night I asked about Tiramisu and the event organizer told me he wasn’t able to replicate their results (he had difficulty converging). He mentioned their model was much harder to train. This is anecdotal, so I can’t confirm. How do DenseNets compare in terms of trainability?

I propose we take the existing Faster R-CNN PyTorch implementation and add a single feature on top. For example, the authors of Mask R-CNN provide benchmarks for Faster R-CNN with RoIAlign only, so we could focus on only implementing RoIAlign? This way we start with a working model (our unit test) and keeping breaking it until we get it to work again.

However, I have a very cursory understanding of these things, so I’ll leave it to you and the others to decide.

2 Likes

OK sounds like we need to do both! Let’s see how many people are around, and we can either go in parallel or serial.

I don’t think I can teach something where we just implement something on top of a black box, so I’d want to fully understand what we’re building on and be able to reimplement it! :slight_smile:

3 Likes

Looking at the Faster R-CNN implementations (PyTorch, Tensorflow, Caffee), they only include bounding box predictions, unlike the paper describes. I don’t think anyone has coded the segmentations part? In that case instance segmentation would also have to be implemented and adds to the difficulty. https://github.com/rbgirshick/py-faster-rcnn/issues/327

Considering the above, let’s call Friday an “implementation day” and let people work on something they’re interested in. We can use each other for support, but work on different projects if desired.

What do you think?

Other ideas to implement:

1 Like

Personally I think it would be more fun to have no more than 2 projects, so that we can collaborate more.

Sounds good to me. Right now we have @Matthew @sravya8 @kelvin and myself joining.

I’ll scope out Mask R-CNN tomorrow and see if I can identify a small component we can carve out and implement. Something chewable :slight_smile:

Okay long post incoming :slight_smile:

I just realized that I met @brendan yesterday during explaining the paper! (nice to meet/e-meet you, we should talk next time, this time I had to leave early)

  1. @jeremy: Yes, I can complete the 100 layers of Tirmasu and I will perhaps fix it this weekend, 2 big issues:
  • the theano has recursion limit issues, which I will change to Tensorflow.
  • Building a ConvNet with 103 layers requires about 10 minutes to debug :confused: often times.
    I review that paper while back, personal notes here and code
  1. For the Mask-R-CNN, I’m about to post my notes tomorrow, since I went through all the other 8 Papers from introducing R-CNN >> Fast RCNN, Faster RCNN >> Faster RCNN + Pyramind >> Faster RCNN for real-time Object detcion.
    Also, for the code, I have an incomplete implantation of the that model too (just not open source yet), I’m reaching out to the authors to explain to me how the RoI Align actually work and gets computed. (Since as I was asking the rest of the 30 people from study group, leveling Pixel-To-Pixel Segmentation with a parallel ConvNet doesn’t seem to add up together. the FCN or FPN have different loss + input style from the ConvNet in Parallel.

In terms of the theoretical understanding, let me know if you have questions:
Since there isn’t exactly much difference from the previous work that those 2, Tirmasu and the Mask R-CNN have changed, except re-arrangement of tasks + network overlays.

  • The 100-Tirmasu-Layers, aka FC-DenseNets, learn very very low level features in the manner of this Encode-Decode model, where the Decode is manly using DeconvNets, most of the previous work were using mostly Pooling + Subsampling alone.

  • For Mask R-CNN: the creme de la creme is the 3rd masking task being ran in parallel to object detection as a classification problem and bounding box prediction as a regression problem. Meanwhile, the 3rd branch, Task known here as the masking goes through another series of ConvNets and then making it’s mind up. (Rest of the details can be distracting, but that’s one main way to arrange the information in the series)

1 Like

I notice that the Mask R-CNN paper doesn’t mention the FC-DenseNet paper, and that the FC-DenseNet authors didn’t test their architecture on MSCOCO (although they mentioned MSCOCO in the paper). Is this because Mask R-CNN and MSCOCO are about instance segmentation while FC-DenseNets only do semantic segmentation?

@Matthew

  1. Mask R-CNN focuses on the COCO challenge here, Which essentially follows a certain criteria of measuring Error Rate per the tasks given. Meanwhile, The FC-Densenet (100 Layers Tirmasu) paper focuses on the urban scene benchmark datasets such as CamVid and Gatech,

  2. The dataset MSCOCO itself can be used for wide range challenges and often introduced as benchmark against other datasets (if the paper is not about the COCO challenge)

The process of doing Semantic Segmentation can be a prior task to do Instance segmentation, since finding an instance of a class ( as in dog 1, dog 2, … dog n) in the picture requires some level of semantic understanding of the image.

After all, the Mask R-CNN is an evident result for having a simpler model that one task can help the other out to accomplish the full