I want to try different word tokenization method for text classification.
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
imdb_clas = DataBlock(
blocks=(TextBlock.from_df('text', seq_len=72), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())
dls = imdb_clas.dataloaders(df, bs=64)
dls.show_batch(max_n=2)
Consider this sample code from fastai documentation. If I want to tokenize the text by using sub-word tokenization or character-based tokenization instead of the default tokenizer from TextBlock, which part should I change and how should I do it?