Learn.get_preds(ds_type = DatasetType.Test) with a Tabular Learner returns predictions in different order than test set data order

(Marc Laugharn) #1

It would be nice to be able to get predictions in the same order as they are in my test dataframe.

Code for reference:

import pandas as pd
from fastai import *
from fastai.tabular import *

path = './'
train_df = pd.read_csv('./train.csv')
split = 40000
valid_idx = range(len(train_df)-split, len(train_df))
test_df = pd.read_csv('./test.csv')
dep_var = 'target'

data = TabularDataBunch.from_df(path, train_df, dep_var, valid_idx=valid_idx, test_df=test_df)
learn = tabular_learner(data, layers=[200,20], metrics=accuracy)

learn.data.show_batch()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(10, 1e-2)

preds, y = learn.get_preds(DatasetType.Test) # <-- I think these are in a different order than test_0, test_1, etc.

As a workaround right now I am just iterating through the test dataframe and predicting each row one at a time, but this is just slow and seems wrong

0 Likes

#2

That is weird, they normally are in the same order. Can you check that data.show_batch(ds_type=DatasetType.Test) returns the same things as your first rows?

Note that test sets are unlabeled, so if you say this because your ys are 0, this isn’t a good check.

0 Likes

(Zachary Mueller) #3

@sgugger I am noticing this as well:

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
       .split_by_rand_pct()
       .label_from_df(cols=dep_var)
       .add_test(TabularList.from_df(test, path=path, cat_names=cat_vars, cont_names = cont_vars, procs=procs))
       .databunch())

Test is my test dataframe:

show_batch:

having_IP_Address@{-1,1} URL_Length@{1,0,-1} Shortining_Service@{1,-1} having_At_Symbol@{1,-1} double_slash_redirecting@{-1,1} Prefix_Suffix@{-1,1} having_Sub_Domain@{-1,0,1} SSLfinal_State@{-1,1,0} Domain_registeration_length@{-1,1} Favicon@{1,-1} port@{1,-1} HTTPS_token@{-1,1} Request_URL@{1,-1} URL_of_Anchor@{-1,0,1} Links_in_tags@{1,-1,0} SFH@{-1,1,0} Submitting_to_email@{-1,1} Abnormal_URL@{-1,1} Redirect@{0,1} on_mouseover@{1,-1} RightClick@{1,-1} popUpWidnow@{1,-1} Iframe@{1,-1} age_of_domain@{-1,1} DNSRecord@{-1,1} web_traffic@{-1,0,1} Page_Rank@{-1,1} Google_Index@{1,-1} Links_pointing_to_page@{1,0,-1} Statistical_report@{-1,1} target
-1 -1 1 1 1 1 1 1 -1 1 1 1 1 1 1 -1 1 1 0 1 1 1 1 -1 1 1 -1 1 1 1 -1
-1 -1 1 1 1 -1 1 -1 -1 1 1 1 1 0 -1 -1 1 1 1 1 1 1 1 -1 1 0 1 1 0 1 -1
1 -1 1 1 1 -1 1 1 -1 1 1 1 1 1 1 -1 1 1 0 1 1 1 1 -1 1 1 1 1 0 1 -1
-1 -1 1 1 1 1 1 1 -1 1 1 1 1 -1 -1 -1 1 1 0 1 1 1 1 -1 -1 1 -1 1 1 1 -1
1 -1 1 -1 1 -1 1 0 -1 1 1 1 1 1 -1 -1 1 1 0 1 1 1 1 -1 1 1 1 1 0 1 -1

And this is from the dataframe

having_IP_Address@{-1,1} URL_Length@{1,0,-1} Shortining_Service@{1,-1} having_At_Symbol@{1,-1} double_slash_redirecting@{-1,1} Prefix_Suffix@{-1,1} having_Sub_Domain@{-1,0,1} SSLfinal_State@{-1,1,0} Domain_registeration_length@{-1,1} Favicon@{1,-1} popUpWidnow@{1,-1} Iframe@{1,-1} age_of_domain@{-1,1} DNSRecord@{-1,1} web_traffic@{-1,0,1} Page_Rank@{-1,1} Google_Index@{1,-1} Links_pointing_to_page@{1,0,-1} Statistical_report@{-1,1} Result@{-1,1}
5916 -1 -1 1 1 1 -1 1 -1 -1 1 1 1 -1 1 0 1 1 0 1 -1
6293 1 -1 1 1 1 -1 1 1 -1 1 1 1 -1 1 1 1 1 0 1 1

They do not show the same thing here (result is my target in this scenario).

0 Likes

#4

By default show_batch shows you samples from the training set, which is shuffled.

0 Likes

(Zachary Mueller) #5

Interesting, I see that now. However another issue, when I pass in a databunch to get preds, they do seem out of order, or the accuracy drops dramatically. When I do a learn.predict() vs learn.get_preds(), predict returns ~97% accuracy whereas get_preds when comparing with the actual truth only gives me ~50%. Is order being lost in get_preds?

0 Likes

#6

I’d need to see more of your code to understand where the problem is.

0 Likes

(Zachary Mueller) #7

Ah I think I see my issue now. get_preds at location 1 (getpreds()[1]) returns the LOCATION of the category on the list, not the category itself. Apologies! Has there been thought to include the predicted category for situations in tabular regression?

0 Likes

#8

No, get_preds always return predictions/ground truth in a non-processed way (so you get the indices, not the classes, yes).

0 Likes

(Zachary Mueller) #9

Ok. Thank you very much sgugger!

0 Likes

(Khunakorn Luyaphan) #10

Has this been resolved? I have this problem as well
At this moment I also literating through every row

0 Likes