Where is the stack of encoders in Transformer?

Looking at the source code here.

I don’t see the stack of encoders as described in the paper and so just looking for clarification on in what respects this follows and does not follow the paper, and also what the thinking is behind this implementation?

Maybe this will be a topic in that bonus session

Hi everyone - does anyone know the answer to @wgpubs’s question? I am also wondering what approach I should use to supply an encoder to the transformer, as its implementation doesn’t appear to contain the NxEncoder part of the architectire described in the original paper.

Now, I may be mistaken, but I think the transformer layers/stack of encoders are being implemented at line 159: self.layers = nn.ModuleList([DecoderLayer(n_heads, d_model, d_head, d_inner, resid_p=resid_p, attn_p=ff_p=ff_p, bias=bias, scale=scale, act=act, double_drop=double_drop, attn_cls=attn_cls) for k in range(n_layers)]).

Since the encoder layers are directly linked the decoders layers, this simplification is made… I’m guessing.

Thanks @imago! I suspect the encoder is not implemented because the parameters we see passed to the DecoderLayer do not subsequently get passed to an Encoder anywhere. To me, the implementation looks like a simple embedding plus positional encoding, followed directly by the decoder. I have indeed seen papers calling this the “decoder-only transformer architecture” so it may have been left out on purpose in fastai’s Transformer, so I am trying to understand if there are any recommendations specific to the Transformer architecture when it comes to when to include and when not to include an Encoder.

The language model associated to transformer does not have encoder layers since those are unmasked. If you want the seq2seq transformer, there is an implementation in this notebook.

Adding my 2 cents …

The fast.ai transformer functions more like BERT, which is essentially a stack of just Transformer encoders. This is because the objective here is language modeling rather than something like NMT (or any other seq2seq task).