"I am trying to replicate an example from fastai in which Jeremy creates a dataloader for multi-level classification (see example here: Multi-target: Road to the Top, Part 4 | Kaggle). My goal is to use this approach for an NLP task, where I have a nested structure of labeled data.
I have successfully replicated the example and am able to numericalize the text and have two categories. However, what I am trying to do is create a model that classifies both categories and also predicts the next word.
The dataloader I am creating is not working as expected and only returns a language model dataloader, i.e., it only returns the numericalized text as x
and the numericalized text plus the next word as y
. I’m not sure why this is happening."
imdb_lm_debug2 = DataBlock(
blocks=(TextBlock.from_df(text_cols = 'text_col', is_lm=False, tok_text_col = 'x1', n_workers=3),
TextBlock.from_df(text_cols = 'text_col', is_lm=True, tok_text_col = 'x2', n_workers=3),
CategoryBlock, CategoryBlock),
n_inp=1,
get_x=ColReader('x1'),
get_y=((ColReader('x2'), ColReader('class_level1'), ColReader('class_level2'))),
splitter=RandomSplitter(valid_pct=0)
)
dls_debug2 = imdb_lm_debug2.dataloaders(df_train, bs=1, seq_len=72)
dls_debug2.one_batch()
This is what I get:
(LMTensorText([[ 2, 7, 78, 7, 19, 37, 16, 8, 9, 127, 97, 99,
14, 8, 380, 12, 881, 103, 0, 63, 12, 9, 94, 175,
24, 1898, 0, 153, 358, 17, 9, 59, 93, 13, 40, 80,
62, 192, 15, 9, 7, 142, 10, 8, 39, 1004, 9, 97,
426, 14, 227, 9, 65, 114, 13, 65, 44, 11, 365, 27,
53, 1248, 1692, 9, 862, 62, 352, 153, 64, 14, 46, 88]],
device='cuda:0'),
TensorText([[ 7, 78, 7, 19, 37, 16, 8, 9, 127, 97, 99, 14,
8, 380, 12, 881, 103, 0, 63, 12, 9, 94, 175, 24,
1898, 0, 153, 358, 17, 9, 59, 93, 13, 40, 80, 62,
192, 15, 9, 7, 142, 10, 8, 39, 1004, 9, 97, 426,
14, 227, 9, 65, 114, 13, 65, 44, 11, 365, 27, 53,
1248, 1692, 9, 862, 62, 352, 153, 64, 14, 46, 88, 961]],
device='cuda:0'))