Character-level language model

kennysong · April 7, 2020, 9:48am

(For fastai 1.0.60.)

Quite a late answer, but I also needed to build a character-level LM, and I didn’t find any good code snippets.

Here’s what I came up with after digging around in fastai/text/transform.py.

This is my custom CharTokenizer class. It’s super simple, mostly copied from the SpacyTokenizer class. The character-level tokenization logic is in the tokenizer() method, which you can customize.

class CharTokenizer(BaseTokenizer):
    def __init__(self, lang:str='no_lang'):
        '''Needed to initialize BaseTokenizer correctly.'''
        super().__init__(lang=lang)

    def tokenizer(self, t:str) -> List[str]:
        '''Turns each character into a token. Replaces spaces with '_'.'''
        return list(t.replace(' ', '_'))

To actually use this in the DataBlocks / TextList API, you need to wrap it in the right classes:

# Not sure why there's so many levels of indirection here...
char_tokenize_processor = TokenizeProcessor(tokenizer=Tokenizer(tok_func=CharTokenizer), include_bos=False)

data_lm = (TextList
              .from_folder(path='data/', 
                           processor=[OpenFileProcessor(), char_tokenize_processor, NumericalizeProcessor()])
              .split_by_rand_pct(0.1)
              .label_for_lm()
              .databunch(bs=128))

Last, you can verify with a data_lm.show_batch().