NLP preds on test dataset issues

I have an nlp test dataset that I’m trying to do inference on. I have a language model and it works if I do learn.predict on a single record. There are two issues I can’t seem to resolve when trying to do it on a test set:

  • How to not split it
  • How to keep it ordered
    test_db = DataBlock(
        blocks=(TextBlock.from_df("text", seq_len=72, vocab=langmodel_lm.vocab),CategoryBlock),
        get_x=ColReader("text"),
        get_y=ColReader("label"),
        splitter=FuncSplitter(lambda i: True))

    test_dl  = test_db.dataloaders(test_df, bs=128, seq_len=80, shuffle_train=False)

    # preds,targs = 
    preds = learn.get_preds(dl=test_dl, with_input=True, reorder=False, with_decoded=True)
    test_df['preds'] = preds[2]
    test_df['targets'] = preds[1]

There are two issues:

  1. It errors if reorder=False so I can only seem to get randomly ordered results
  2. It fails writing the predictions back as they are different lengths. For a dataset of 1000 records it splits it 996 to 4
1 Like

Hi Scott,

For problem 1 you can get test_dl.get_idxs() and use that to sort your predictions in the original order. Unfortunately it seems that in fastai v2 there is no easier way to get sorted predictions for NLP.

I didn’t really understand the second issue. Could you describe the problem in a bit more detail?

You’re not making a test set here, you’re making entirely new dataloaders to train on. You should take your existing DataLoader and use dl = dls.test_dl(test_df)

Something like this @muellerzr?

dev_df = pd.read_csv(path/'dev.csv')
train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')

train_db = DataBlock(
    blocks=(TextBlock.from_df("text", seq_len=72, vocab=langmodel_lm.vocab),CategoryBlock),
    get_x=ColReader("text"),
    get_y=ColReader("label"),
    splitter=RandomSplitter(0.1))

train_dl = train_db.dataloaders(train_df, bs=128, seq_len=80)
train_dl.test_dl(test_df)

preds = learn.get_preds(dl=train_dl.test_dl, with_input=True, reorder=True, with_decoded=True)
test_df['preds'] = preds[2]
test_df['targets'] = preds[1]

This errors with
AttributeError: ‘TextLearner’ object has no attribute ‘pbar’

No. If you’ve already got a Learner which is trained, (which I don’t see generated here), it would be:

test_dl = learn.dls.test_dl(test_df)
preds = learn.get_preds(dl=test_dl)

If you can’t do the above then a step was skipped along the way. This should be after your model is trained. Even upon export (learn.export) you should be able to run this

1 Like

Awesome thanks, that seems to work.

Matthew SF Choo also posted a medium article showing a bit more work. I copied his functions and with some slight changes have been inferring with models. The article is called, " Making NLP predictions on new datasets using Fast.ai"

my version is here on github