Fastai.text test predictions

krasin · March 22, 2020, 7:13am

How to do predictions on a test dataset with fastai.text? I try with learn.get_preds but I get wrong results. May be data are shuffled somehow.

First I train the model

dls = TextDataLoaders.from_df(df, text_col='text', label_col='target', seq_len=36)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, 
                           metrics=skm_to_fastai(f1_score), seq_len=36)
learn.fine_tune(4, 1e-2)

then I load and make predictions on a dataset (the same in this example):

dl_test = learn.dls.test_dl(df, with_labels=True)
preds = learn.get_preds(dl=dl_test, with_decoded=True)
df['preds'] = preds[2] # I assume that `preds[1]` are the targets and `preds[2]` are the predicted labels

The result is close to random.

f1_score(df['target'], df['preds'])
0.398

If I apply learn.predict the results are good, but it is very slow.

df['preds'] = df['text'].map(lambda x: learn.predict(x)[0])

The proper results are also given by:

f1_score(preds[1], preds[2])
0.86

For information, here is the format of pred:

preds

(tensor([[0.8350, 0.1650],
         [0.8271, 0.1729],
         [0.7271, 0.2729],
         ...,
         [0.7816, 0.2184],
         [0.7872, 0.2128],
         [0.7755, 0.2245]]),
 TensorCategory([0, 0, 0,  ..., 0, 0, 0]),
 tensor([0, 0, 0,  ..., 0, 0, 0]))

RogerS49 · March 24, 2020, 6:11am

Should that not be preds ?

krasin · March 24, 2020, 6:40am

Yes, it is preds, I correct the post, thanks. But the question is not changing.

RogerS49 · March 24, 2020, 6:42am

Whats in df.head() for both train and test.
Edit…
Sorry reread labels are target.

Next Question

Were is the print out loss etcetera when learning and fine tuning

krasin · March 24, 2020, 9:22am

I have added only the problem part. The full notebook is here - https://www.paperspace.com/krasin/notebook/przpxr5ey
(playing with https://www.kaggle.com/c/nlp-getting-started data)

krasin · April 2, 2020, 7:25pm

@sgugger any idea what I am doing wrong with fastai.text test_dl usage?

sgugger · April 2, 2020, 7:29pm

You need to tokenize your dataframe with tokenize_df I’d say.

krasin · April 3, 2020, 7:17am

No change. Both the results and the labels in the output of learn.get_preds(...) are shuffled

sgugger · April 3, 2020, 12:07pm

Yes, the dataloader gives you the results by order of lengths, to be memory-efficient.

krasin · April 3, 2020, 4:05pm

Thank you, make sense. Can we restore the original order somehow? I’ll try with sorting the dataframe.

sgugger · April 3, 2020, 4:47pm

I added reorder=True to Learner.get_preds, so it will now be done by default.

krasin · April 3, 2020, 8:56pm

Now it works great