Character-level language model

edwardjross · February 28, 2019, 10:24pm

Hey, I know this question is old but I just came across it and it might help someone else.

When you’re using the data block API with custom tokenization you need to customize the preprocessing (because like you say by default they ''.join()).

I’ve had success with something like:

itos = [UNK, BOS] + list(string.ascii_lowercase + " -'")
vocab=Vocab(itos)
tokenizer=Tokenizer(CharTokenizer, pre_rules=[], post_rules=[])

data_lm = (TextList
           .from_df(df, processor=processors)
           .random_split_by_pct(0.1)
           .label_for_lm()
           .databunch(bs=bs) )

I’ve got an example notebook with all the details.