Achieving performance close to BERT on sentiment analysis with a model 57 times smaller

Hello everyone,

I’ve just released this article about using knowledge distillation to achieve about 88.4% accuracy on the SST-2 sentiment analysis dataset with a tiny bidirectional LSTM model. For reference, BERT trained on the same dataset gets 93%.

This method can potentially be used to run models directly on low-powered devices like smartphones for lower latency, or for massively alleviating the load on APIs running neural networks, at the expense of relatively little accuracy.

With the method in the article, the resulting model is so small that I found one core on my laptop’s CPU could run inference on a single sentence (length on the scale of the SST-2 dataset, batch size 1) more than 500 times per second.

If you have any questions or insights, please feel free to share them :slight_smile:
Here’s the code:

ps. training the student model on fastaiv2 gives a performance bump to 89% :smiley: may update the article if I find a way to implement the entire workflow in fastai