Training language model to predict characters

Hi! I would like to use Language Models for predicting characters :grin:. My input data are words (one word on each line in txt). I’ve decided to use TextList for storing values (see code below). data_lm = (TextList.from_df(df, cols ="text") .split_by_rand_pct(0.1, seed=101) .label_for_lm() .databunch(bs=10, num_workers=0)). I know that I need my custom preprocessing (also with tokenizer) and I don’t know how to implement one. Is there anyone who tried something like this?

My main idea is to generate words with distribution learned from training words.

For future readers, I had the same question and answered it in this thread: