Hello,
i need to use subword tokenization for transfer learning a german language model as the pretrained model also uses subword tokenization.
I have seen the following example in lesson 10 of the 2020 fastai course:
def subword(sz):
sp = SubwordTokenizer(vocab_sz=sz)
sp.setup(txts)
return ' '.join(first(sp([txt]))[:40])
So i thought it would be possible to use such a tokenizer in a textloader based on the fastai docu like this:
TextDataLoaders.from_df(df, text_col='text', tok_tfm=SubwordTokenizer(25000), is_lm=True, valid_pct=0.1)
But unfortunately i get the following error:
Vocabulary size is smaller than required_chars. 29 vs 58. Increase vocab_size or decrease character_coverage with --character_coverage option.
So, what does it mean and is there any other way to use subword tokenization with fastai dataloaders?
I really need this for my german model .
Thank you very much