Does get_preds randomize / shuffle the order of the predictions?

I’m having a very annoying problem, and feel like I must be missing something obvious.

I have a test set that I’d like to produce predictions for, valid_labeled.csv. It has two columns, ‘sentiment’ and ‘content’. The ‘sentiment’ column contains target labels [0 or 1], and the ‘content’ column contains the text of tweets.

All I want is to be able to produce predictions for that test dataset, in the order of that test dataset.

However, it appears to me that get_preds randomizes the order of the produced predictions!

I load in this data like so:

df_test=pd.read_csv(DATAPATH/'valid_labeled.csv')
df_test['is_valid']=1

I’ve already trained my model on another dataset, ‘train_labeled_s1.csv’, formatted the same way. I create a new df that contains the training data and the test data, and save it into a csv:

df_tmp = pd.read_csv(DATAPATH/'train_labeled_s1.csv')
df_tmp=df_tmp[df_tmp['is_valid']==0]
df_tmp=pd.concat([df_tmp,df_test])
df_tmp.to_csv(DATAPATH/'df_tmp.csv', index=False)

I now create a databunch and learner based on this data. The training data goes into learn.data.train_ds. The test data goes into learn.data.valid_ds:

df_tmp.to_csv(DATAPATH/'df_tmp.csv', index=False)
tmplist=TextList.from_csv(DATAPATH, 'df_tmp.csv', cols='content', vocab=data_lm.train_ds.vocab)
data_tmp = (tmplist
                .split_from_df(col='is_valid')
                .label_from_df(cols='sentiment')
                .databunch(bs=40))
learn = text_classifier_learner(data_tmp, AWD_LSTM)

I now load in the model weights and make predictions on the validation set:

learn.load('s1_12e_unfreeze_001')
predictions_val=learn.get_preds(DatasetType.Valid)

This returns both the produced predictions probabilities and the associated target labels:

for i in range(5):
    print(predictions_val[0].cpu().numpy()[i],predictions_val[1].cpu().numpy()[i])

produces:

[0.738483 0.261517] 0
[0.513575 0.486425] 0
[0.264869 0.735131] 1
[0.365152 0.634848] 0
[0.055368 0.944632] 0

And I get good prediction accuracy on these:

accuracy(*predictions_val)

gives me accuracy of 80.1%.

However, the order of the targets in the predictions does NOT match the order of the targets in the test dataset dataframe:

for i in range(5):
    print(predictions_val[1].cpu().numpy()[i], df_test['sentiment'].iloc[i])

produces:

0 0
0 1
1 1
0 1
0 1

This means that the predictions produced by get_preds does not match up with the order of the data that I want to predict on (i.e. the data in the valid_labeled.csv)!!

I tried this entire process with loading the test data into the test slot instead of the validation slot, but it produces the exact same prediction probabilities:

test_list=TextList.from_csv(DATAPATH, 'valid_labeled.csv', cols='content', vocab=data_lm.train_ds.vocab)
data_tmp = (tmplist
                .split_from_df(col='is_valid')
                .label_from_df(cols='sentiment')
                .add_test(test_list)
                .databunch(bs=40))
learn = text_classifier_learner(data_tmp, AWD_LSTM)
learn.load('s1_12e_unfreeze_001')
predictions_val=learn.get_preds(DatasetType.Valid)
predictions_test=learn.get_preds(DatasetType.Test)
#predictions probabilities do match between the validation and test sets
for i in range(5):
    print(predictions_val[0].cpu().numpy()[i], predictions_test[0].cpu().numpy()[i])

produces:

[0.738483 0.261517] [0.738483 0.261517]
[0.513575 0.486425] [0.513575 0.486425]
[0.264869 0.735131] [0.264869 0.735131]
[0.365152 0.634848] [0.365152 0.634848]
[0.055368 0.944632] [0.055368 0.944632]

I’m really confounded. All I want is to be able to produce predictions for a test dataset, in the order of that test dataset.

Any help is greatly appreciated.

2 Likes

Doh! ordered=True

IMHO, “ordered=True” should be the default for get_preds, bc why would you ever want to shuffle the order of the predictions on the test set??

3 Likes

That’s true, feel free to suggest a PR fixing this!

1 Like

I found that cnn learner’s get_preds doesn’t need ordered=True to get the real order. Is it correct that only rnn and tabular learner needs to set ordered=True? Thank you!

Just RNNs actually

2 Likes

Right… tabular doesn’t scramble the order.

4 Likes