By multi-model I am referring to taking input as image, text training them for classification, parsing, translation. What is the latest research on this? In Kaggle days Jeff Dean did mention he is working around this field.
Also, there is this technique that Geoffrey Hinton wrote about in the paper Distilling the Knowledge in a Neural Network. The idea is to take my different models, which are acting as an ensemble and to combine/compress them into a single model. Any research on this?
EDIT: Jeff Dean was talking about NAS in kaggle days.
A cool example of distillation is described in this blog post: https://medium.com/tensorflow/introducing-bodypix-real-time-person-segmentation-in-the-browser-with-tensorflow-js-f1948126c2a0?linkId=63657671
This is a real-time semantic segmentation model for people that can detect different body parts. The actual model they end up with is very simple: just MobileNet as the base network plus an upsampling layer at the end. This is fast enough to run in real-time in the browser or on a mobile device.
The problem they faced is: we don’t have any dataset where people’s body parts are masked as different targets. COCO has segmentation masks but only for people as a whole. And they didn’t have time to do all this labeling by hand.
The solution: train a big ResNet101 model on a combination of real images from COCO as well as simulated images that were made by a 3D rendering program. Thanks to the COCO images, the model learns what real people look like, but only sees the entire body but not the different body parts. For the simulated images, it’s easy to get body part labels (small modification to the 3D renderer). From these images, the ResNet101 model learns what different body parts look like.
They needed to use this combination of real and synthesized images, so that the ResNet would both learn what body parts look like but also what real humans look like.
Here is the trick: the ResNet model is now an automated way to get body part segmentation labels for real images. So you can now train a smaller model on a large dataset of humans, without having labels for these images. The ResNet model is what provides these labels.
So they finally train the MobileNet model on real images of people, using the trained ResNet model to generate the body part labels from those real images.