It took me an hour to figure this out, but the underscore used in the Kaggle notebook is not actually an underscore. It’s a character called a “lower one eight block” or unicode 2581. The author has used this in the notebook and it is also the character used by SentencePiece when tokenizing your text as a prefix for keys in the dictionary that have spaces before them.
From this page: GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.
SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol “▁” (U+2581) as follows.
What this results in is students trying the code on their own laptops, manually entering the line
tokz.vocab[‘_of’]
and getting very confused when they get a key error on their local machine.
Here are the two underscores next to each other for comparison: ▁_
So if you check if a key exists, you can’t just use an underscore in your source code, you have to use the same unicode character or some other technique to get at it.
I think commenting that particular line will really help future learners who want to run the code locally to avoid confusion.
Thanks for an awesome educational experience otherwise!! fast.ai absolutely rocks!
Mark.