Hi all, I have been meaning to understand how the gpt-2 model is implemented in PyTorch from scratch.
So far, I understand the paper very well, and the theory behind how the model works - Transformers, Decoder only block with 12 layers etc.
I was wondering if somebody else is interested in this, and we could together write an article similar to the annotated transformer to make it easy for everyone else in the future who wishes to understand the source code and implementation of the model.
Was wondering if anybody else is keen? I am happy to provide insights about the model, theory, working etc and understand it well but really need help grasping the source code.