Learn.get_preds(ds_type = DatasetType.Test) with a Tabular Learner returns predictions in different order than test set data order

mlaugharn · March 14, 2019, 6:35pm

It would be nice to be able to get predictions in the same order as they are in my test dataframe.

Code for reference:

import pandas as pd
from fastai import *
from fastai.tabular import *

path = './'
train_df = pd.read_csv('./train.csv')
split = 40000
valid_idx = range(len(train_df)-split, len(train_df))
test_df = pd.read_csv('./test.csv')
dep_var = 'target'

data = TabularDataBunch.from_df(path, train_df, dep_var, valid_idx=valid_idx, test_df=test_df)
learn = tabular_learner(data, layers=[200,20], metrics=accuracy)

learn.data.show_batch()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(10, 1e-2)

preds, y = learn.get_preds(DatasetType.Test) # <-- I think these are in a different order than test_0, test_1, etc.

As a workaround right now I am just iterating through the test dataframe and predicting each row one at a time, but this is just slow and seems wrong

sgugger · March 15, 2019, 2:20am

That is weird, they normally are in the same order. Can you check that data.show_batch(ds_type=DatasetType.Test) returns the same things as your first rows?

Note that test sets are unlabeled, so if you say this because your ys are 0, this isn’t a good check.

muellerzr · May 14, 2019, 4:52am

@sgugger I am noticing this as well:

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
       .split_by_rand_pct()
       .label_from_df(cols=dep_var)
       .add_test(TabularList.from_df(test, path=path, cat_names=cat_vars, cont_names = cont_vars, procs=procs))
       .databunch())

Test is my test dataframe:

show_batch:

having_IP_Address@{-1,1}	URL_Length@{1,0,-1}	Shortining_Service@{1,-1}	having_At_Symbol@{1,-1}	double_slash_redirecting@{-1,1}	Prefix_Suffix@{-1,1}	having_Sub_Domain@{-1,0,1}	SSLfinal_State@{-1,1,0}	Domain_registeration_length@{-1,1}	Favicon@{1,-1}	port@{1,-1}	HTTPS_token@{-1,1}	Request_URL@{1,-1}	URL_of_Anchor@{-1,0,1}	Links_in_tags@{1,-1,0}	SFH@{-1,1,0}	Submitting_to_email@{-1,1}	Abnormal_URL@{-1,1}	Redirect@{0,1}	on_mouseover@{1,-1}	RightClick@{1,-1}	popUpWidnow@{1,-1}	Iframe@{1,-1}	age_of_domain@{-1,1}	DNSRecord@{-1,1}	web_traffic@{-1,0,1}	Page_Rank@{-1,1}	Google_Index@{1,-1}	Links_pointing_to_page@{1,0,-1}	Statistical_report@{-1,1}	target
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	-1	1	1	1	-1
-1	-1	1	1	1	-1	1	-1	-1	1	1	1	1	0	-1	-1	1	1	1	1	1	1	1	-1	1	0	1	1	0	1	-1
1	-1	1	1	1	-1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	-1	-1	-1	1	1	0	1	1	1	1	-1	-1	1	-1	1	1	1	-1
1	-1	1	-1	1	-1	1	0	-1	1	1	1	1	1	-1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1

And this is from the dataframe

	having_IP_Address@{-1,1}	URL_Length@{1,0,-1}	Shortining_Service@{1,-1}	having_At_Symbol@{1,-1}	double_slash_redirecting@{-1,1}	Prefix_Suffix@{-1,1}	having_Sub_Domain@{-1,0,1}	SSLfinal_State@{-1,1,0}	Domain_registeration_length@{-1,1}	Favicon@{1,-1}	…	popUpWidnow@{1,-1}	Iframe@{1,-1}	age_of_domain@{-1,1}	DNSRecord@{-1,1}	web_traffic@{-1,0,1}	Page_Rank@{-1,1}	Google_Index@{1,-1}	Links_pointing_to_page@{1,0,-1}	Statistical_report@{-1,1}	Result@{-1,1}
5916	-1	-1	1	1	1	-1	1	-1	-1	1	…	1	1	-1	1	0	1	1	0	1	-1
6293	1	-1	1	1	1	-1	1	1	-1	1	…	1	1	-1	1	1	1	1	0	1	1

They do not show the same thing here (result is my target in this scenario).

sgugger · May 14, 2019, 12:58pm

By default show_batch shows you samples from the training set, which is shuffled.

muellerzr · May 14, 2019, 1:04pm

Interesting, I see that now. However another issue, when I pass in a databunch to get preds, they do seem out of order, or the accuracy drops dramatically. When I do a learn.predict() vs learn.get_preds(), predict returns ~97% accuracy whereas get_preds when comparing with the actual truth only gives me ~50%. Is order being lost in get_preds?

sgugger · May 14, 2019, 1:05pm

I’d need to see more of your code to understand where the problem is.

muellerzr · May 14, 2019, 1:08pm

Ah I think I see my issue now. get_preds at location 1 (getpreds()[1]) returns the LOCATION of the category on the list, not the category itself. Apologies! Has there been thought to include the predicted category for situations in tabular regression?

sgugger · May 14, 2019, 1:17pm

No, get_preds always return predictions/ground truth in a non-processed way (so you get the indices, not the classes, yes).

muellerzr · May 14, 2019, 1:29pm

Ok. Thank you very much sgugger!

polohot · June 27, 2019, 8:08pm

Has this been resolved? I have this problem as well
At this moment I also literating through every row

Pawan28a · February 8, 2020, 9:29am

Hi,

I’m facing the same issue. Here’s the code I’ve used:-
test = ImageList.from_folder(TEST)
learn = load_learner(modelDir, test = test)
predictions,_ = learn.get_preds(ds_type=DatasetType.Test)
labels = np.argmax(predictions, 1)

loftiskg · February 8, 2020, 11:02pm

Yeah, is there any follow up on this issue? I am experiencing the same thing. When I use get_pred(DatasetType.Test) it returns all the rows out of order. Currently, I am iterating through each row in my test set and using the predict method, but this is painfully slow. Would love to know if anyone has solved this problem! Thanks!

having_IP_Address@{-1,1}	URL_Length@{1,0,-1}	Shortining_Service@{1,-1}	having_At_Symbol@{1,-1}	double_slash_redirecting@{-1,1}	Prefix_Suffix@{-1,1}	having_Sub_Domain@{-1,0,1}	SSLfinal_State@{-1,1,0}	Domain_registeration_length@{-1,1}	Favicon@{1,-1}	port@{1,-1}	HTTPS_token@{-1,1}	Request_URL@{1,-1}	URL_of_Anchor@{-1,0,1}	Links_in_tags@{1,-1,0}	SFH@{-1,1,0}	Submitting_to_email@{-1,1}	Abnormal_URL@{-1,1}	Redirect@{0,1}	on_mouseover@{1,-1}	RightClick@{1,-1}	popUpWidnow@{1,-1}	Iframe@{1,-1}	age_of_domain@{-1,1}	DNSRecord@{-1,1}	web_traffic@{-1,0,1}	Page_Rank@{-1,1}	Google_Index@{1,-1}	Links_pointing_to_page@{1,0,-1}	Statistical_report@{-1,1}	target
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	-1	1	1	1	-1
-1	-1	1	1	1	-1	1	-1	-1	1	1	1	1	0	-1	-1	1	1	1	1	1	1	1	-1	1	0	1	1	0	1	-1
1	-1	1	1	1	-1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	-1	-1	-1	1	1	0	1	1	1	1	-1	-1	1	-1	1	1	1	-1
1	-1	1	-1	1	-1	1	0	-1	1	1	1	1	1	-1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1

having_IP_Address@{-1,1}	URL_Length@{1,0,-1}	Shortining_Service@{1,-1}	having_At_Symbol@{1,-1}	double_slash_redirecting@{-1,1}	Prefix_Suffix@{-1,1}	having_Sub_Domain@{-1,0,1}	SSLfinal_State@{-1,1,0}	Domain_registeration_length@{-1,1}	Favicon@{1,-1}	port@{1,-1}	HTTPS_token@{-1,1}	Request_URL@{1,-1}	URL_of_Anchor@{-1,0,1}	Links_in_tags@{1,-1,0}	SFH@{-1,1,0}	Submitting_to_email@{-1,1}	Abnormal_URL@{-1,1}	Redirect@{0,1}	on_mouseover@{1,-1}	RightClick@{1,-1}	popUpWidnow@{1,-1}	Iframe@{1,-1}	age_of_domain@{-1,1}	DNSRecord@{-1,1}	web_traffic@{-1,0,1}	Page_Rank@{-1,1}	Google_Index@{1,-1}	Links_pointing_to_page@{1,0,-1}	Statistical_report@{-1,1}	target
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	-1	1	1	1	-1
-1	-1	1	1	1	-1	1	-1	-1	1	1	1	1	0	-1	-1	1	1	1	1	1	1	1	-1	1	0	1	1	0	1	-1
1	-1	1	1	1	-1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	-1	-1	-1	1	1	0	1	1	1	1	-1	-1	1	-1	1	1	1	-1
1	-1	1	-1	1	-1	1	0	-1	1	1	1	1	1	-1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1

having_IP_Address@{-1,1}	URL_Length@{1,0,-1}	Shortining_Service@{1,-1}	having_At_Symbol@{1,-1}	double_slash_redirecting@{-1,1}	Prefix_Suffix@{-1,1}	having_Sub_Domain@{-1,0,1}	SSLfinal_State@{-1,1,0}	Domain_registeration_length@{-1,1}	Favicon@{1,-1}	port@{1,-1}	HTTPS_token@{-1,1}	Request_URL@{1,-1}	URL_of_Anchor@{-1,0,1}	Links_in_tags@{1,-1,0}	SFH@{-1,1,0}	Submitting_to_email@{-1,1}	Abnormal_URL@{-1,1}	Redirect@{0,1}	on_mouseover@{1,-1}	RightClick@{1,-1}	popUpWidnow@{1,-1}	Iframe@{1,-1}	age_of_domain@{-1,1}	DNSRecord@{-1,1}	web_traffic@{-1,0,1}	Page_Rank@{-1,1}	Google_Index@{1,-1}	Links_pointing_to_page@{1,0,-1}	Statistical_report@{-1,1}	target
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	-1	1	1	1	-1
-1	-1	1	1	1	-1	1	-1	-1	1	1	1	1	0	-1	-1	1	1	1	1	1	1	1	-1	1	0	1	1	0	1	-1
1	-1	1	1	1	-1	1	1	-1	1	1	1	1	1	1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1
-1	-1	1	1	1	1	1	1	-1	1	1	1	1	-1	-1	-1	1	1	0	1	1	1	1	-1	-1	1	-1	1	1	1	-1
1	-1	1	-1	1	-1	1	0	-1	1	1	1	1	1	-1	-1	1	1	0	1	1	1	1	-1	1	1	1	1	0	1	-1