Learn.get_preds(ds_type = DatasetType.Test) with a Tabular Learner returns predictions in different order than test set data order

It would be nice to be able to get predictions in the same order as they are in my test dataframe.

Code for reference:

import pandas as pd
from fastai import *
from fastai.tabular import *

path = './'
train_df = pd.read_csv('./train.csv')
split = 40000
valid_idx = range(len(train_df)-split, len(train_df))
test_df = pd.read_csv('./test.csv')
dep_var = 'target'

data = TabularDataBunch.from_df(path, train_df, dep_var, valid_idx=valid_idx, test_df=test_df)
learn = tabular_learner(data, layers=[200,20], metrics=accuracy)

learn.data.show_batch()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(10, 1e-2)

preds, y = learn.get_preds(DatasetType.Test) # <-- I think these are in a different order than test_0, test_1, etc.

As a workaround right now I am just iterating through the test dataframe and predicting each row one at a time, but this is just slow and seems wrong

That is weird, they normally are in the same order. Can you check that data.show_batch(ds_type=DatasetType.Test) returns the same things as your first rows?

Note that test sets are unlabeled, so if you say this because your ys are 0, this isn’t a good check.

@sgugger I am noticing this as well:

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
       .split_by_rand_pct()
       .label_from_df(cols=dep_var)
       .add_test(TabularList.from_df(test, path=path, cat_names=cat_vars, cont_names = cont_vars, procs=procs))
       .databunch())

Test is my test dataframe:

show_batch:

having_IP_Address@{-1,1} URL_Length@{1,0,-1} Shortining_Service@{1,-1} having_At_Symbol@{1,-1} double_slash_redirecting@{-1,1} Prefix_Suffix@{-1,1} having_Sub_Domain@{-1,0,1} SSLfinal_State@{-1,1,0} Domain_registeration_length@{-1,1} Favicon@{1,-1} port@{1,-1} HTTPS_token@{-1,1} Request_URL@{1,-1} URL_of_Anchor@{-1,0,1} Links_in_tags@{1,-1,0} SFH@{-1,1,0} Submitting_to_email@{-1,1} Abnormal_URL@{-1,1} Redirect@{0,1} on_mouseover@{1,-1} RightClick@{1,-1} popUpWidnow@{1,-1} Iframe@{1,-1} age_of_domain@{-1,1} DNSRecord@{-1,1} web_traffic@{-1,0,1} Page_Rank@{-1,1} Google_Index@{1,-1} Links_pointing_to_page@{1,0,-1} Statistical_report@{-1,1} target
-1 -1 1 1 1 1 1 1 -1 1 1 1 1 1 1 -1 1 1 0 1 1 1 1 -1 1 1 -1 1 1 1 -1
-1 -1 1 1 1 -1 1 -1 -1 1 1 1 1 0 -1 -1 1 1 1 1 1 1 1 -1 1 0 1 1 0 1 -1
1 -1 1 1 1 -1 1 1 -1 1 1 1 1 1 1 -1 1 1 0 1 1 1 1 -1 1 1 1 1 0 1 -1
-1 -1 1 1 1 1 1 1 -1 1 1 1 1 -1 -1 -1 1 1 0 1 1 1 1 -1 -1 1 -1 1 1 1 -1
1 -1 1 -1 1 -1 1 0 -1 1 1 1 1 1 -1 -1 1 1 0 1 1 1 1 -1 1 1 1 1 0 1 -1

And this is from the dataframe

having_IP_Address@{-1,1} URL_Length@{1,0,-1} Shortining_Service@{1,-1} having_At_Symbol@{1,-1} double_slash_redirecting@{-1,1} Prefix_Suffix@{-1,1} having_Sub_Domain@{-1,0,1} SSLfinal_State@{-1,1,0} Domain_registeration_length@{-1,1} Favicon@{1,-1} popUpWidnow@{1,-1} Iframe@{1,-1} age_of_domain@{-1,1} DNSRecord@{-1,1} web_traffic@{-1,0,1} Page_Rank@{-1,1} Google_Index@{1,-1} Links_pointing_to_page@{1,0,-1} Statistical_report@{-1,1} Result@{-1,1}
5916 -1 -1 1 1 1 -1 1 -1 -1 1 1 1 -1 1 0 1 1 0 1 -1
6293 1 -1 1 1 1 -1 1 1 -1 1 1 1 -1 1 1 1 1 0 1 1

They do not show the same thing here (result is my target in this scenario).

By default show_batch shows you samples from the training set, which is shuffled.

Interesting, I see that now. However another issue, when I pass in a databunch to get preds, they do seem out of order, or the accuracy drops dramatically. When I do a learn.predict() vs learn.get_preds(), predict returns ~97% accuracy whereas get_preds when comparing with the actual truth only gives me ~50%. Is order being lost in get_preds?

I’d need to see more of your code to understand where the problem is.

Ah I think I see my issue now. get_preds at location 1 (getpreds()[1]) returns the LOCATION of the category on the list, not the category itself. Apologies! Has there been thought to include the predicted category for situations in tabular regression?

No, get_preds always return predictions/ground truth in a non-processed way (so you get the indices, not the classes, yes).

Ok. Thank you very much sgugger!

Has this been resolved? I have this problem as well
At this moment I also literating through every row

2 Likes

Hi,

I’m facing the same issue. Here’s the code I’ve used:-
test = ImageList.from_folder(TEST)
learn = load_learner(modelDir, test = test)
predictions,_ = learn.get_preds(ds_type=DatasetType.Test)
labels = np.argmax(predictions, 1)

1 Like

Yeah, is there any follow up on this issue? I am experiencing the same thing. When I use get_pred(DatasetType.Test) it returns all the rows out of order. Currently, I am iterating through each row in my test set and using the predict method, but this is painfully slow. Would love to know if anyone has solved this problem! Thanks!

2 Likes