Anyone has seen this work before?
Single Headed Attention RNN: Stop Thinking With Your Head
The author Stephen Merity is also one of the authors of the paper Quasi-Recurrent Neural Networks
The paper reports the model could achieve comparable results in language modelling as Transformer under budget computation (single GPU and less than 24 hours). From the statistics, it looks promising particularly for consumer users. Training a language model is no longer a work for those with high-spec machine.
I loved this paper and had always meant to spend more time on it, but I never actually go around to playing around with the code. Its a pity it didn’t gain more traction.
Yes, the paper is super interesting. I heard about it when Stephen Merity was on the TWIML AI podcast. I liked his point that RNNs might be understudied at the moment and, while the progress made with transformer architectures is certainly impressive, shouldn’t be completely abandoned.
Btw, Stephen Merity is also a co-author of the AWD-LSTM paper - Regularizing and Optimizing LSTM Language Models - which is the architecture behind fastai’s ULMFit.
The key reason why transformers work so well is that they allow for parallelization while a traditional RNN doesn’t (plus their use of attention, of course, but that’s also something RNNs can do).
A brave researcher might try to come up with a new type of RNN that does allow for parallel computation.