Character-level language model

I’m attempting to create a char-level lm from wikitext-103 but I’m having a bit of trouble with the tokenization. At the moment I’m making a ‘custom’ Tokenizer but it’s not actually…tokenizing. If I grab one row from the train df like in the docs run tokenizer.process_text(example_text, tok) it runs no problem. When passed to create a TextList:
data_lm = (TextList.from_csv(path, 'train.csv', tokenizer=tok, header=None) .random_split_by_pct(0.1) .label_for_lm() .databunch(bs=bs) ) data_lm.show_batch()

I get what appears to be normally tokenized text (since in the source I can see it’s just returning a ' '.join() on the tokenized list. What am I missing? My tokenizer is extremely simple:
class CharTokenizer(BaseTokenizer): def __init__(self, lang:str): self.lang = lang def tokenizer(self, t: str) -> List[str]: return list(t.replace('<unk>', '').replace(' ', ' ~ ')) def add_special_cases(self, toks:Collection[str]): pass
as I just need to convert the strings to a list and replace the spaces with some random char to preserve where the spaces are located in the original string. I’m sure there’s some simple thing I’m missing or misunderstanding about the datablock api and if anyone has any guidance I would appreciate it!

Hey, I know this question is old but I just came across it and it might help someone else.

When you’re using the data block API with custom tokenization you need to customize the preprocessing (because like you say by default they ''.join()).

I’ve had success with something like:

itos = [UNK, BOS] + list(string.ascii_lowercase + " -'")
vocab=Vocab(itos)
tokenizer=Tokenizer(CharTokenizer, pre_rules=[], post_rules=[])

data_lm = (TextList
           .from_df(df, processor=processors)
           .random_split_by_pct(0.1)
           .label_for_lm()
           .databunch(bs=bs) )

I’ve got an example notebook with all the details.

I’ve been working on this problem and this ended up being helpful. The critical thing that i missed that was incredibly frustrating, was that the characters in the processor must be in a list and not a string.

Hopefully others may find this useful

Any new ideas? I’ve tried many ways to deal with it without any progress… My idea is to use TextList, but first i have to pass my own Tokenizer/Processor.

(For fastai 1.0.60.)

Quite a late answer, but I also needed to build a character-level LM, and I didn’t find any good code snippets.

Here’s what I came up with after digging around in fastai/text/transform.py.

This is my custom CharTokenizer class. It’s super simple, mostly copied from the SpacyTokenizer class. The character-level tokenization logic is in the tokenizer() method, which you can customize.

class CharTokenizer(BaseTokenizer):
    def __init__(self, lang:str='no_lang'):
        '''Needed to initialize BaseTokenizer correctly.'''
        super().__init__(lang=lang)

    def tokenizer(self, t:str) -> List[str]:
        '''Turns each character into a token. Replaces spaces with '_'.'''
        return list(t.replace(' ', '_'))

To actually use this in the DataBlocks / TextList API, you need to wrap it in the right classes:

# Not sure why there's so many levels of indirection here...
char_tokenize_processor = TokenizeProcessor(tokenizer=Tokenizer(tok_func=CharTokenizer), include_bos=False)

data_lm = (TextList
              .from_folder(path='data/', 
                           processor=[OpenFileProcessor(), char_tokenize_processor, NumericalizeProcessor()])
              .split_by_rand_pct(0.1)
              .label_for_lm()
              .databunch(bs=128))

Last, you can verify with a data_lm.show_batch().

Anybody trained character-level NLP using custom models or GRU with fastai?
Standard LSTM is sooo slow for my problem!