I’m attempting to create a char-level lm from wikitext-103 but I’m having a bit of trouble with the tokenization. At the moment I’m making a ‘custom’ Tokenizer but it’s not actually…tokenizing. If I grab one row from the train df like in the docs run tokenizer.process_text(example_text, tok) it runs no problem. When passed to create a TextList: data_lm = (TextList.from_csv(path, 'train.csv', tokenizer=tok, header=None) .random_split_by_pct(0.1) .label_for_lm() .databunch(bs=bs) ) data_lm.show_batch()
I get what appears to be normally tokenized text (since in the source I can see it’s just returning a ' '.join() on the tokenized list. What am I missing? My tokenizer is extremely simple: class CharTokenizer(BaseTokenizer): def __init__(self, lang:str): self.lang = lang def tokenizer(self, t: str) -> List[str]: return list(t.replace('<unk>', '').replace(' ', ' ~ ')) def add_special_cases(self, toks:Collection[str]): pass
as I just need to convert the strings to a list and replace the spaces with some random char to preserve where the spaces are located in the original string. I’m sure there’s some simple thing I’m missing or misunderstanding about the datablock api and if anyone has any guidance I would appreciate it!
I’ve been working on this problem and this ended up being helpful. The critical thing that i missed that was incredibly frustrating, was that the characters in the processor must be in a list and not a string.
This is my custom CharTokenizer class. It’s super simple, mostly copied from the SpacyTokenizer class. The character-level tokenization logic is in the tokenizer() method, which you can customize.
def __init__(self, lang:str='no_lang'):
'''Needed to initialize BaseTokenizer correctly.'''
def tokenizer(self, t:str) -> List[str]:
'''Turns each character into a token. Replaces spaces with '_'.'''
return list(t.replace(' ', '_'))
To actually use this in the DataBlocks / TextList API, you need to wrap it in the right classes:
# Not sure why there's so many levels of indirection here...
char_tokenize_processor = TokenizeProcessor(tokenizer=Tokenizer(tok_func=CharTokenizer), include_bos=False)
data_lm = (TextList
processor=[OpenFileProcessor(), char_tokenize_processor, NumericalizeProcessor()])