Transformer - the why and how of its design

(Malcolm McLean) #1

Hi all. I have been reading about Transformers in language modeling, with the intent of course to understand them. However, after reading many articles, they remain a blur.

My confusion is NOT about the code itself. With enough effort I can follow the formulas and Python code that implements Transformers. Rather I don’t understand the “why” behind the design, what purpose each component serves and how it does so.

Can anyone point me to articles that explain the intuitions behind the Transformer design? The many articles I find merely regurgitate the images and text from the original paper that describes the architecture.

Here are a few specific questions:
(Note: I understand vanilla attention and how it works with RNNs (their why and how is clear). I understand that Transformers are great and that they work very well in practice. I understand the formulas and the implementation. But I do not understand why and how the Transformer design makes sense.)

  • What problems beyond standard Attention are Transformers trying to solve?

  • The design borrows from the language of databases with Query, Key, and Value. Why choose these terms? In what way is Transformer’s use of them analogous to a database?

  • Multi-headed Attention. What problem with the architecture is this addition trying to solve? How does it solve that problem?

Thanks for any leads that might clarify these questions.

0 Likes

(Darek Kleczek) #2

I am learning this stuff myself, will share my understanding, but don’t consider this as expert advice :wink:

What problems beyond standard Attention are Transformers trying to solve?
Transformers have been developed in the context of recurrent models (e.g. LSTM/GRU) that used to perform best for language modeling. They have one big advantage, which is parallel calculation of the encodings, which means that they can train faster, and it’s easier to apply them on very large corpuses. Another advantage - probably not so clear though - is that they may be able to capture better long term dependencies between words in a sequence - the “distance” or amount of calculations between any two words in a sequence should be constant.

The design borrows from the language of databases with Query, Key, and Value. Why choose these terms? In what way is Transformer’s use of them analogous to a database?
I’m not sure if the database comparison is helpful… These are parts of a mechanism to calculate self-attention. Here’s a very good explanation how this works: http://jalammar.github.io/illustrated-transformer/

Multi-headed Attention. What problem with the architecture is this addition trying to solve? How does it solve that problem?
The way I understand this is similar to different layers in a DNN - they capture different features, or relationships between words. Any two words in one sentence may be connected in many different ways, for example subject-object, context, time-dimension, … Multi-headed attention allows to capture these different relationships.

Again, this article was very helpful to me: http://jalammar.github.io/illustrated-transformer/

1 Like