Hi! I’m fiddling around the Transformer implementation to modify it for a time series-ish project and found something that might should be changed, but I wasn’t sure if it’s worth an issue so I posted here.
Unlike AWD_LSTM, the Transformer implementation is not aware of the pad_token configuration. Since transformers does processing like adding positional encoding even to those padded tokens, I think the most correct way to implement this is to create a mask for attention (as opposed to simply setting padding_idx for Embedding). Dunno if this makes a difference since the model might be able to learn to do nothing for padding tokens.
Also, while fastai pads in the beginning it seems Transformers should be padded on the end:
GPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
I don’t have a good enough GPU to play with the full NLP model so I wonder if these changes make sense and could make a difference.