How to do predictions on a test dataset with fastai.text? I try with learn.get_preds but I get wrong results. May be data are shuffled somehow.

First I train the model

dls = TextDataLoaders.from_df(df, text_col='text', label_col='target', seq_len=36)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, 
                           metrics=skm_to_fastai(f1_score), seq_len=36)
learn.fine_tune(4, 1e-2)

then I load and make predictions on a dataset (the same in this example):

dl_test = learn.dls.test_dl(df, with_labels=True)
preds = learn.get_preds(dl=dl_test, with_decoded=True)
df['preds'] = preds[2] # I assume that `preds[1]` are the targets and `preds[2]` are the predicted labels

The result is close to random.

f1_score(df['target'], df['preds'])

If I apply learn.predict the results are good, but it is very slow.

df['preds'] = df['text'].map(lambda x: learn.predict(x)[0]) 

The proper results are also given by:

f1_score(preds[1], preds[2])

For information, here is the format of pred:

(tensor([[0.8350, 0.1650],
         [0.8271, 0.1729],
         [0.7271, 0.2729],
         [0.7816, 0.2184],
         [0.7872, 0.2128],
         [0.7755, 0.2245]]),
 TensorCategory([0, 0, 0,  ..., 0, 0, 0]),
 tensor([0, 0, 0,  ..., 0, 0, 0]))
Should that not be preds ?

Yes, it is preds, I correct the post, thanks. But the question is not changing.

I have added only the problem part. The full notebook is here -
You need to tokenize your dataframe with tokenize_df I’d say.

No change. Both the results and the labels in the output of learn.get_preds(...) are shuffled

Yes, the dataloader gives you the results by order of lengths, to be memory-efficient.

Thank you, make sense. Can we restore the original order somehow? I’ll try with sorting the dataframe.

I added reorder=True to Learner.get_preds, so it will now be done by default.

Now it works great :slight_smile: