Let’s Build the GPT Tokenizer (text version)

We just published Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs, a translation of Andrej Karpathy’s excellent video into a book chapter. It’s likely to be of interest to anyone wanting to learn about how modern large language models work.

Next week we’ll post an article showing how we made it using Solveit.

Reply here if you have any questions or thoughts about the article or tokenization in general.

15 Likes

Thats great i am creating DeepSeek from scratch using fastai now a days

3 Likes

Hi Jeremy, the post looks great. I look forward to going through it in detail over the next few days.

Any time I try to click on the Solveit link on any browser I get a message saying “This site can’t be reached” or “The connection was reset”. I also get a message from Norton warning of a dangerous website or connection. This is happening in Firefox, Edge and Chrome. Just wondering if anyone else is having this issue?

Sounds like your anti-virus has a false positive. Does it have a way to add exceptions? If so, add solve.it.com

it’s nice post, i have some questions:

  1. how to eval tokenizers from pretrained or trained one on a specific langauge!
  2. tokenizers with GPU instead of cpu
  3. tokenizer with task-heel