Lesson 4 official topic

It took me an hour to figure this out, but the underscore used in the Kaggle notebook is not actually an underscore. It’s a character called a “lower one eight block” or unicode 2581. The author has used this in the notebook and it is also the character used by SentencePiece when tokenizing your text as a prefix for keys in the dictionary that have spaces before them.

From this page: GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol “▁” (U+2581) as follows.

What this results in is students trying the code on their own laptops, manually entering the line

and getting very confused when they get a key error on their local machine.

Here are the two underscores next to each other for comparison: ▁_

So if you check if a key exists, you can’t just use an underscore in your source code, you have to use the same unicode character or some other technique to get at it.

I think commenting that particular line will really help future learners who want to run the code locally to avoid confusion.

Thanks for an awesome educational experience otherwise!! fast.ai absolutely rocks!
