(For fastai 1.0.60.)
Quite a late answer, but I also needed to build a character-level LM, and I didn’t find any good code snippets.
Here’s what I came up with after digging around in fastai/text/transform.py
.
This is my custom CharTokenizer
class. It’s super simple, mostly copied from the SpacyTokenizer
class. The character-level tokenization logic is in the tokenizer()
method, which you can customize.
class CharTokenizer(BaseTokenizer):
def __init__(self, lang:str='no_lang'):
'''Needed to initialize BaseTokenizer correctly.'''
super().__init__(lang=lang)
def tokenizer(self, t:str) -> List[str]:
'''Turns each character into a token. Replaces spaces with '_'.'''
return list(t.replace(' ', '_'))
To actually use this in the DataBlocks / TextList API, you need to wrap it in the right classes:
# Not sure why there's so many levels of indirection here...
char_tokenize_processor = TokenizeProcessor(tokenizer=Tokenizer(tok_func=CharTokenizer), include_bos=False)
data_lm = (TextList
.from_folder(path='data/',
processor=[OpenFileProcessor(), char_tokenize_processor, NumericalizeProcessor()])
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=128))
Last, you can verify with a data_lm.show_batch()
.