Fastai ChineseTokenizer doesn't work

JumpyJason · June 20, 2025, 1:33am

I am exploring with the NLP notebook from chapter 10 of the book.
But fastai gives an error when passed in parameters lang=zh:

scn = WordTokenizer(lang='zh')

The error message is:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 scn = WordTokenizer(lang='zh')

File D:\Program Files\Python310\lib\site-packages\fastai\text\core.py:122, in SpacyTokenizer.__init__(self, lang, special_toks, buf_sz)
    120 self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
    121 nlp = spacy.blank(lang)
--> 122 for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])
    123 self.pipe,self.buf_sz = nlp.pipe,buf_sz

AttributeError: 'ChineseTokenizer' object has no attribute 'add_special_case'

Is it because fastai hasn’t added support for Chinese yet?

Conwyn · June 20, 2025, 7:56pm

I think in the lecture Jerremy said it was better using sub-word tokenizer. I assume your Chinesse is far better than mine but a Chinese noun might consist of multiple characters 公交车 public transport bus so the three characters would be grouped in a sentence whereas treating the three as a single word would restrict the neural network. Jeremy speaks Chinese so can explain it better.
Regards Conwyn

JumpyJason · June 20, 2025, 11:39pm

I get your point. Maybe it’s inappropriate to use ChineseTokenizer.
But I still think it’s a bug in fastai. I have seen cases where spaCy can handle Chinese sentences very well.

Conwyn · June 21, 2025, 10:41am

I thought fast ai used spacy as the default. Maybe it has changed.