Different results for .get_preds() and .predict()

JoshVarty · February 12, 2019, 4:53am

I have trained and exported a model that works with tabular data and I am now following the Inference Learner while trying to generate results on a test set.

I load my test dataframe with:

test_df = pd.read_csv("data/testSet.csv", dtype=dtypes, nrows=5)

I use the same cat_names, cont_names and procs when loading my test_df dataframe into a TabularList:

data = TabularList.from_df(test_df, cat_names=cat_names, cont_names=cont_names, procs=procs)

I load my learner and mark my data as test data:

learner = load_learner('data/', test=data)

I generate predictions with .get_preds() and print them.

preds,y = learner.get_preds(ds_type=DatasetType.Test)
print(preds)

tensor([[0.8476, 0.1524],
        [0.7529, 0.2471],
        [0.8152, 0.1848],
        [0.8072, 0.1928],
        [0.7275, 0.2725]])

To double check, I use the same learner to predict directly on the TabularList

learner.predict(data[1])

(Category 0, tensor(0), tensor([0.9867, 0.0133]))

Strangely this doesn’t give the same result as above. I also tried against the dataframe (test_df) itself.

learner.predict(test_df.iloc[0])

(Category 0, tensor(0), tensor([0.5344, 0.4656]))

I wouldn’t expect the last one to work (the preprocessing steps haven’t been applied to the dataframe) but I don’t understand why the other two are giving me different answers.

Have I misunderstood something? Are there any additional steps (put something in .eval() mode or something) when calling either .predict() or .get_preds()? I would expect them to give the same results.

hmcp22 · February 28, 2019, 4:17pm

I am finding the same issue. Getting different results with .get_preds() and .predict(). Have you figured out why this is happening?

sgugger · February 28, 2019, 4:41pm

This has been discussed in a topic in fastai users. First you need to pass ordered=True in your call to get_preds to get the predictions in the same order as your dataframe (otheriwse they are sorted by length).
Then you will still get a small difference because in predict, the text is passed without padding, but in get_preds, padding is added to make several texts form on batch.

JoshVarty · February 28, 2019, 6:07pm

I had issues that went away when I stopped creating my dataframe with dtypes. I’m not sure which dtype in particular was causing the issue for me.

hmcp22 · February 28, 2019, 8:07pm

Ok that makes sense now. Thank you for the explanation

SapirGershov · March 3, 2019, 11:16am

How about segmentation predictions?
I was working with the CamVid notebook and the prediction didn’t match to the original file. I was sure the predictions where at the same order as the data.train_ds.y.items, but it seems I was mistaken.
Any advice?

sgugger · March 3, 2019, 2:51pm

No semantic predictions should be in the same order as the data, for the validation or test set (you said training, but the training set is always shuffled).

SapirGershov · March 3, 2019, 3:20pm

Thank you for your quick responde.
Is there a way for me to save the predicted sigmentetion with it’s matching file name?

sgugger · March 3, 2019, 3:28pm

Filenames will be in data.valid_ds.x.items (replace valid by test if necessary) and the predicted segmentation are the results of get_preds.

arora_aman · March 29, 2019, 1:17am

I am getting different results everytime I run fastai - tabular. Is that normal? I literally just re run the jupyter notebook and get a different within_10 % error for regression problem?
Sometimes the tabular network works really well and sometimes it just goes crazy?

How to overcome this problem? Is there anything in particular that could be causing this?

arora_aman · March 29, 2019, 1:18am

Also, do nueral networks for tabular data overfit very easily?

narasimha · July 6, 2019, 12:34pm

Have you found reason for such issue? Could it be due to drop outs in the network? If you have them…

polohot · July 7, 2019, 9:18am

I’ve been trying to find ways to predict a new test set for about 2 weeks now,
by assigning test data to the model

src2 = TabularList.from_df(X_tst, path=PATH, cat_names=None, cont_names=cont_names).split_none().label_empty()
data2 = src2.databunch(bs=1) 
learn.data.test_dl = data2.train_dl
a,b = learn.get_preds(ds_type=DatasetType.Test)

I still got different result compare to iterating one by one using .predict()
like my prediction got shuffled some how.
Any help is much appreciated

JonathanR · July 16, 2019, 4:26am

Assuming you need your predictions sorted by file name of the test set, I found this useful:

preds, _ = learner.get_preds(ds_type=DatasetType.Test)
pred_ix = pd.Series(data.test_ds.x.items).sort_values().index
preds = preds[pred_ix]

digital.entomologist · July 28, 2019, 2:34pm

Hey,

I have the same issue with the vision module even though it was stated that the results of get_preds() are always ordered.

I was trying to run the get_preds() on my validation set to calculate another metric on it.

After running

preds = learn.get_preds(); preds[0][0], preds[1][0]

I get: (tensor([-0.5458, -1.3085, -0.2104, 0.6581, -0.6128]), tensor(0))

However:

learn.predict(data.valid_ds.x[0]) returns
(Category 3, tensor(3), tensor([-0.5458, -1.3085, -0.2104, 0.6581, -0.6128]))

As you can see the prediction values are the same but the categories are off.
Can someone please advise what I should be doing since the vision get_preds() doesn’t have the ordered flag?

FYI I am running this in a Kaggle Kernel with fastai.version 1.0.55 I am also using a custom loss function (in case this has some sort of effect).

Thanks!

muellerzr · July 28, 2019, 2:41pm

You need to take the argmax() of the preds to get the prediction. You can see that category 3 has the highest value and is the assumed prediction.

digital.entomologist · July 28, 2019, 2:45pm

Hi @muellerzr,

Thanks. That is what I am currently doing. But I just don’t know what get_preds() is actually returning to me, as not all the predicted categories are 0…

muellerzr · July 28, 2019, 2:55pm

get_preds() returns the raw probabilities for every option and a label (ground truth). If this is on a test set then all the labels are 0 by default and you ONLY want to look at the predictions.

In general you want to look only at those probabilities too.

Mark_F · August 13, 2019, 3:42am

I am having a similar problem to the problem above, where .get_preds() is not matching .predict(). I am trying to run this on the test (unlabelled) set.

I am using Tabular data on the Kaggle Titanic data, so the result needs to match the PassengerId. I found a really clunky way to it with .learn(), but I’d like to be able to do it more efficiently with argmax() and .get_preds().

I tried passing in the argument ordered=True as suggested, but .get_preds() does not recognize as an argument and it is not in the docs, either. How can I have my get_preds() data in the same order as my dataframe for the test set?

avatar · September 30, 2019, 5:26pm

I think similar issues happens TextClasDataBunch as well

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5) as well.
interp_T = TextClassificationInterpretation.from_learner(learn)
interp_T.show_top_losses(k = 5)

The predictions in the output are very different from learn.predict(df_test.iloc[r].text) . Since the documents are grouped and sorted according to the length of the text to reduce the amount of padding, setting ordered = true probably wont work for text.