I have written detailed blog post about implementing self attention from scratch using PyTorch. I have tried to keep things simple and jargon free with step by step explanation.
… loved reading your blog – especially the librarian analogy, that was an eye-opener. Thanks a lot!
This is a clear, beginner-friendly tutorial aimed at readers who already know basic PyTorch (tensors, nn.Linear, softmax) and want to understand the internals of self-attention – the heart of modern LLMs.
Tip: I like concepts and code – would invite you to do the same
You discussed: tokenization / embedding / cosine-similarity / attention.
The embedding and attention parts were very good. Tokenization and cosine-similarity could benefit from a bit more understanding
Since you like PyTorch as much as I do, maybe add a small note about F.scaled_dot_product_attention
Also, a quick note on the shape transition when introducing batches: [context, attn_dim] → [batch, context, attn_dim]
(and when you introduce multi-head: [batch, context, heads, attn_dim])
Maybe your blog could end by: Next steps could be:
Positional encoding
Masking
Multi-head attention
Residual blocks / LayerNorm
Encoder-Decoder
Training / loss
So, well done – nice example – thx and keep blogging …
Many thanks for your pointed feedback on what you liked it and what could be improved, I really like your approach. I have taken your feedback into consideration and I’ll update this post in next few days. In the meantime I have also started writing about Multi-Head Attention.