Here is a network from this paper: Shape-aware Instance Segmentation
As you can see, it does Instance Segmentation and uses almost the same network:
DeepConvs + RPNs to pick the Region of Interest + Bounding box and lastly the Instance Segmentation.
The Mask R-CNN uses a much simpler and modular network to do the task. Modular as in, the ResNeXt can be swapped out and swapped in, meanwhile the layers of Instance Segmentation can be increases and reduced without making it effect the Object Detection + Bounding Box.