Customized Word Tokenization from TextBlock

wpan · December 26, 2020, 5:20pm

I want to try different word tokenization method for text classification.

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')

imdb_clas = DataBlock(
    blocks=(TextBlock.from_df('text', seq_len=72), CategoryBlock),
    get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())

dls = imdb_clas.dataloaders(df, bs=64)
dls.show_batch(max_n=2)

Consider this sample code from fastai documentation. If I want to tokenize the text by using sub-word tokenization or character-based tokenization instead of the default tokenizer from TextBlock, which part should I change and how should I do it?