NLP paper's about small models

It is well known that in the NLP field a lot of the new research is always something like “more data, more training, more parameters”.
As a student of, I know that you can accomplish a lot (if not more) with a single GPU if you using common sense.
So my question - Recently articles like ELECTRA or ROBERTA have been working on reducing the NN. Do you know any other recommended articles in this style? Even articles that are not from the big companies.
In particular, I do look for articles that focus on transformers, because at work I work with them.

Are you looking to train them from scratch or fine tune for downstream tasks?
You can still fine tune quite large BERT models on a single GPU.

You can take a look on distillation as well

Also what I was doing is shrinking Roberta several times to train from scratch on 1-2 GPU’s. It wont give you comparable performance, but can practical to play with it and for debugging your scripts.