NLP preds on test dataset issues

scradley · September 8, 2020, 7:20am

I have an nlp test dataset that I’m trying to do inference on. I have a language model and it works if I do learn.predict on a single record. There are two issues I can’t seem to resolve when trying to do it on a test set:

How to not split it
How to keep it ordered

    test_db = DataBlock(
        blocks=(TextBlock.from_df("text", seq_len=72, vocab=langmodel_lm.vocab),CategoryBlock),
        get_x=ColReader("text"),
        get_y=ColReader("label"),
        splitter=FuncSplitter(lambda i: True))

    test_dl  = test_db.dataloaders(test_df, bs=128, seq_len=80, shuffle_train=False)

    # preds,targs = 
    preds = learn.get_preds(dl=test_dl, with_input=True, reorder=False, with_decoded=True)
    test_df['preds'] = preds[2]
    test_df['targets'] = preds[1]

There are two issues:

It errors if reorder=False so I can only seem to get randomly ordered results
It fails writing the predictions back as they are different lengths. For a dataset of 1000 records it splits it 996 to 4

stefan-ai · September 8, 2020, 9:15am

Hi Scott,

For problem 1 you can get test_dl.get_idxs() and use that to sort your predictions in the original order. Unfortunately it seems that in fastai v2 there is no easier way to get sorted predictions for NLP.

I didn’t really understand the second issue. Could you describe the problem in a bit more detail?

muellerzr · September 8, 2020, 12:18pm

You’re not making a test set here, you’re making entirely new dataloaders to train on. You should take your existing DataLoader and use dl = dls.test_dl(test_df)

scradley · September 8, 2020, 8:42pm

Something like this @muellerzr?

dev_df = pd.read_csv(path/'dev.csv')
train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')

train_db = DataBlock(
    blocks=(TextBlock.from_df("text", seq_len=72, vocab=langmodel_lm.vocab),CategoryBlock),
    get_x=ColReader("text"),
    get_y=ColReader("label"),
    splitter=RandomSplitter(0.1))

train_dl = train_db.dataloaders(train_df, bs=128, seq_len=80)
train_dl.test_dl(test_df)

preds = learn.get_preds(dl=train_dl.test_dl, with_input=True, reorder=True, with_decoded=True)
test_df['preds'] = preds[2]
test_df['targets'] = preds[1]

This errors with
AttributeError: ‘TextLearner’ object has no attribute ‘pbar’

muellerzr · September 8, 2020, 8:45pm

No. If you’ve already got a Learner which is trained, (which I don’t see generated here), it would be:

test_dl = learn.dls.test_dl(test_df)
preds = learn.get_preds(dl=test_dl)

If you can’t do the above then a step was skipped along the way. This should be after your model is trained. Even upon export (learn.export) you should be able to run this

scradley · September 8, 2020, 9:04pm

Awesome thanks, that seems to work.

daveramseymusic · November 22, 2021, 4:24pm

Matthew SF Choo also posted a medium article showing a bit more work. I copied his functions and with some slight changes have been inferring with models. The article is called, " Making NLP predictions on new datasets using Fast.ai"

my version is here on github