I’ve just released this article https://blog.floydhub.com/knowledge-distillation/ about using knowledge distillation to achieve about 88.4% accuracy on the SST-2 sentiment analysis dataset with a tiny bidirectional LSTM model. For reference, BERT trained on the same dataset gets 93%.
This method can potentially be used to run models directly on low-powered devices like smartphones for lower latency, or for massively alleviating the load on APIs running neural networks, at the expense of relatively little accuracy.
With the method in the article, the resulting model is so small that I found one core on my laptop’s CPU could run inference on a single sentence (length on the scale of the SST-2 dataset, batch size 1) more than 500 times per second.
If you have any questions or insights, please feel free to share them
Here’s the code: https://github.com/tacchinotacchi/distil-bilstm
ps. training the student model on fastaiv2 gives a performance bump to 89% may update the article if I find a way to implement the entire workflow in fastai