Open questions about Knowledge distillation

Hello forum!

I’ve been working with transfer learning in produce domain, and faced a lot of questions without answers. I’d like to write my questions here and invite everyone to comment and share any ideas and information on this topic.

  1. Knowledge distillation: If you are training image classification model with penultimate layer as embedding, do you connect only last classification layers to teachers outputs, or you may calculate loss based on embedding layer? What are the best approaches from common sense, before one goes brute force testing all possible variants? (I believe that I will essentially do all variants, but I’d like to start from the most promising one to ship first model to production and then work on improvements.) In my case teacher is resnet34 or vgg16 and small model is mobilenet-v1 or v2.