Very interesting paper where they manage to reduce complexity of the self-attention of a transformer to be O(n) instead of O(n2).
This combined with other very interesting papers that came out recently seems like it could open up transformers to be more accessible. For example:
Funnel Transformer: https://arxiv.org/abs/2006.03236
Basically using the same concept as UNet in vision but with a transformer architecture. So they have pooling layers to reduce the dimensionality of the hidden representation and in the decoder they upscale them using hidden representation of the same size from the encoder.
Those two papers feel like they could be combined to make transformers much much less computationally expensive.