Hi all. I have been reading about Transformers in language modeling, with the intent of course to understand them. However, after reading many articles, they remain a blur.
My confusion is NOT about the code itself. With enough effort I can follow the formulas and Python code that implements Transformers. Rather I don’t understand the “why” behind the design, what purpose each component serves and how it does so.
Can anyone point me to articles that explain the intuitions behind the Transformer design? The many articles I find merely regurgitate the images and text from the original paper that describes the architecture.
Here are a few specific questions:
(Note: I understand vanilla attention and how it works with RNNs (their why and how is clear). I understand that Transformers are great and that they work very well in practice. I understand the formulas and the implementation. But I do not understand why and how the Transformer design makes sense.)
What problems beyond standard Attention are Transformers trying to solve?
The design borrows from the language of databases with Query, Key, and Value. Why choose these terms? In what way is Transformer’s use of them analogous to a database?
Multi-headed Attention. What problem with the architecture is this addition trying to solve? How does it solve that problem?
Thanks for any leads that might clarify these questions.