NLP for 2 classes and language model

petrhrobar · January 2, 2023, 5:13pm

"I am trying to replicate an example from fastai in which Jeremy creates a dataloader for multi-level classification (see example here: Multi-target: Road to the Top, Part 4 | Kaggle). My goal is to use this approach for an NLP task, where I have a nested structure of labeled data.

I have successfully replicated the example and am able to numericalize the text and have two categories. However, what I am trying to do is create a model that classifies both categories and also predicts the next word.

The dataloader I am creating is not working as expected and only returns a language model dataloader, i.e., it only returns the numericalized text as x and the numericalized text plus the next word as y. I’m not sure why this is happening."

imdb_lm_debug2 = DataBlock(
    blocks=(TextBlock.from_df(text_cols = 'text_col', is_lm=False, tok_text_col = 'x1', n_workers=3),  
            TextBlock.from_df(text_cols = 'text_col', is_lm=True, tok_text_col = 'x2', n_workers=3),
            CategoryBlock, CategoryBlock),
    n_inp=1,
    get_x=ColReader('x1'),
    get_y=((ColReader('x2'), ColReader('class_level1'), ColReader('class_level2'))),
    splitter=RandomSplitter(valid_pct=0)
    )

dls_debug2 = imdb_lm_debug2.dataloaders(df_train, bs=1, seq_len=72)
dls_debug2.one_batch()

This is what I get:
(LMTensorText([[   2,    7,   78,    7,   19,   37,   16,    8,    9,  127,   97,   99,
            14,    8,  380,   12,  881,  103,    0,   63,   12,    9,   94,  175,
            24, 1898,    0,  153,  358,   17,    9,   59,   93,   13,   40,   80,
            62,  192,   15,    9,    7,  142,   10,    8,   39, 1004,    9,   97,
           426,   14,  227,    9,   65,  114,   13,   65,   44,   11,  365,   27,
            53, 1248, 1692,    9,  862,   62,  352,  153,   64,   14,   46,   88]],
        device='cuda:0'),
 TensorText([[   7,   78,    7,   19,   37,   16,    8,    9,  127,   97,   99,   14,
             8,  380,   12,  881,  103,    0,   63,   12,    9,   94,  175,   24,
          1898,    0,  153,  358,   17,    9,   59,   93,   13,   40,   80,   62,
           192,   15,    9,    7,  142,   10,    8,   39, 1004,    9,   97,  426,
            14,  227,    9,   65,  114,   13,   65,   44,   11,  365,   27,   53,
          1248, 1692,    9,  862,   62,  352,  153,   64,   14,   46,   88,  961]],
        device='cuda:0'))