Transformer - the why and how of its design

Hi all. I have been reading about Transformers in language modeling, with the intent of course to understand them. However, after reading many articles, they remain a blur.

My confusion is NOT about the code itself. With enough effort I can follow the formulas and Python code that implements Transformers. Rather I don’t understand the “why” behind the design, what purpose each component serves and how it does so.

Can anyone point me to articles that explain the intuitions behind the Transformer design? The many articles I find merely regurgitate the images and text from the original paper that describes the architecture.

Here are a few specific questions:
(Note: I understand vanilla attention and how it works with RNNs (their why and how is clear). I understand that Transformers are great and that they work very well in practice. I understand the formulas and the implementation. But I do not understand why and how the Transformer design makes sense.)

  • What problems beyond standard Attention are Transformers trying to solve?

  • The design borrows from the language of databases with Query, Key, and Value. Why choose these terms? In what way is Transformer’s use of them analogous to a database?

  • Multi-headed Attention. What problem with the architecture is this addition trying to solve? How does it solve that problem?

Thanks for any leads that might clarify these questions.

2 Likes

I am learning this stuff myself, will share my understanding, but don’t consider this as expert advice :wink:

What problems beyond standard Attention are Transformers trying to solve?
Transformers have been developed in the context of recurrent models (e.g. LSTM/GRU) that used to perform best for language modeling. They have one big advantage, which is parallel calculation of the encodings, which means that they can train faster, and it’s easier to apply them on very large corpuses. Another advantage - probably not so clear though - is that they may be able to capture better long term dependencies between words in a sequence - the “distance” or amount of calculations between any two words in a sequence should be constant.

The design borrows from the language of databases with Query, Key, and Value. Why choose these terms? In what way is Transformer’s use of them analogous to a database?
I’m not sure if the database comparison is helpful… These are parts of a mechanism to calculate self-attention. Here’s a very good explanation how this works: http://jalammar.github.io/illustrated-transformer/

Multi-headed Attention. What problem with the architecture is this addition trying to solve? How does it solve that problem?
The way I understand this is similar to different layers in a DNN - they capture different features, or relationships between words. Any two words in one sentence may be connected in many different ways, for example subject-object, context, time-dimension, … Multi-headed attention allows to capture these different relationships.

Again, this article was very helpful to me: http://jalammar.github.io/illustrated-transformer/

4 Likes

Thanks for sharing your understanding. The article you referenced is very informative, especially on “what”. Not so much on “how” and “why”, however.

What problems beyond standard Attention are Transformers trying to solve?..

Got it, thanks!

The design borrows from the language of databases with Query, Key, and Value. Why choose these terms? In what way is Transformer’s use of them analogous to a database?

The article says,

What are the “query”, “key”, and “value” vectors?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

Uh, ok, but it did not play out that way for me, even after grasping the calculations. Exactly why and how did the original authors come up with this particular Query, Key, Value design?

Multi-headed Attention. What problem with the architecture is this addition trying to solve? How does it solve that problem?
The way I understand this is similar to different layers in a DNN - they capture different features, or relationships between words. Any two words in one sentence may be connected in many different ways, for example subject-object, context, time-dimension, … Multi-headed attention allows to capture these different relationships.

You mean like the various features detected by different activations in a visual DNN, for example, eyes and fur textures? It’s definitely a plausible analogy.

Still I would like to understand more specifically how starting with randomly initialized Q/K/V matrices discovers different language features, and why these can be successfully combined with concatenation+multiplication by another learnable matrix.

I think I’ll have to give up on this one up for now, and be content with understanding the “what” without clarity on the “why” and “how”.

Thanks again for your reply; I appreciate your responding.

Cheers and good night,
Malcolm

1 Like

I would interpret Multi head attention slightly differently. Think of multi head attention as different filters in a conv layer or different neurons at a given hidden layer. The different heads in the self attention layer operate at the same hierarchy so to speak.

The transformer has a stack of encoders/decoders. Each stack can be thought of as a different layer, building up a hierarchy of features.

So taking the CNN analogy, multi head represents multiple filters at a given level; the hierarchy/multi layer comes from the stacking.

2 Likes

Makes sense, thanks for clarification!

I’ve been trying to understand Transformers too, and I’m also very confused. I’ll take a stab at one of your questions, if nothing else as a sanity check for myself.

What problems beyond standard Attention are Transformers trying to solve?

I think it’s less about what the Transformer is adding and more about what it’s taking away: namely, the recurrent component. My (very broad) understanding of the development of the field is that RNNs were adopted because of structural advantages like being able to accept inputs of arbitrary length and because the idea of a single recurrent unit appears pleasingly simple. But in practice, there were a lot of issues getting them to work, especially to do with learning long-range dependencies, and vanishing/exploding gradients. The fiddly bits of more advanced RNNs like LSTMS and GRUs are attempts to solve these problems. Finally, with the Transformer it’s sort of come full circle, in effect saying RNNs are more trouble than they’re worth after all, and with a really clever application of attention we can skip the recurrent stuff altogether. So my (again, very broad) understanding of the top-down motivation for the Transformer architecture is it’s trying to capture sequential patterns in the data without actually using any recurrent structure.