Using SubwordTokenizer in TextDataLoaders

chris3 · October 21, 2020, 4:22pm

Hello,
i need to use subword tokenization for transfer learning a german language model as the pretrained model also uses subword tokenization.

I have seen the following example in lesson 10 of the 2020 fastai course:

def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

So i thought it would be possible to use such a tokenizer in a textloader based on the fastai docu like this:

TextDataLoaders.from_df(df, text_col='text', tok_tfm=SubwordTokenizer(25000), is_lm=True, valid_pct=0.1)

But unfortunately i get the following error:
Vocabulary size is smaller than required_chars. 29 vs 58. Increase vocab_size or decrease character_coverage with --character_coverage option.

So, what does it mean and is there any other way to use subword tokenization with fastai dataloaders?
I really need this for my german model .

Thank you very much

darek.kleczek · October 22, 2020, 4:56am

Tokenizers are not easy unfortunately… If you are using a pretrained model, then you should also use the same pretrained tokenizer that was used with that model - you need to have the same vocab. In the example above it looks like you’re trying to create a new tokenizer, which may not be compatible with your pretrained model.

There are good examples of subword tokenizers/models and transfer language for non-English languages in the fastai NLP course, but that is unfortunately with fastai version 1…

chris3 · October 22, 2020, 10:43am

After many tries i found a possible solution:

TextBlock.from_df('text', tok=SubwordTokenizer(lang='en', vocab_sz=1000)

So for anyone who searches in the future, use fastAi with custom tokenizer like this.