Difference in result during inference when using learn.get_preds() or learn.predict()

I am trying do inference on the tabular dataset and notice i get different results when i use test dataset and run inference as

learn.get_preds(DatasetType.Test)

or

df.apply(learn.predict, axis=1)

This is a simple reproducible notebook gist with needed pickle files. Check the confusion matrix in both cases.

Might be I am making mistakes in how library is being used, any feedback is welcomed. Thanks !!

This comes from the fact that the dataframe you used for tests has been modified by the fastai library (the processors apply their transform inplace). I’ll look at why to try to fix it, but in the meantime, reload your dataframe to get the same results with predict.

Edit: This is now fixed in master.

Thanks for the quick fix Sylvian. can confirm it is working fine. I have notice a small issues in data.classes value when using tabulardatabunch for classification task. It gets the categorical values as classes which is not unique classes of dependant variable. Although it does have correct count of classes under data.c I need to do a simple fix if i want to see the correct values as data.classes = data.valid_ds.y.classes I think the reason for this behavior is due to this code block Not sure as what is the best way to fix it.

With a text_classifier_learner I still run into the same issue of getting different results using learn.get_preds(DatasetType.Test, ordered=True) and df.apply(learn.predict, axis=1) . Was the fix specific to tabular data?
Reloading or creating a deep copy of the dataframe with df_copy = df.copy() , helps to get the same categories as get_preds() , however the prediction probabilities differ quite a bit.
Any ideas why this is happening?

The issue was specific to tabular, or so I thought. I’d need more details and a reproducible example to fix this.

Ok, forget about copying the dataframe. I was applying the predict method to the entire dataframe instead of the text column which yield some weird results…

However, I still get different classes and probabilities when applying predict() and get_preds() to the test dataframe. Here is a minimal example based on the IMDB text classification tutorial. It’s not supposed to classify well, just a quick and light-weight way to replicate the issue.

Am I missing something essential, or are these two methods even supposed to return the same results?

from fastai.text import *

path = untar_data(URLs.IMDB_SAMPLE)
data = pd.read_csv(path/'texts.csv')

train_data = data.sample(150, random_state=42)
valid_data = data.drop(train_data.index).sample(50, random_state=42)
test_data = data.drop(train_data.index.append(valid_data.index)).sample(50, random_state=42)

data_lm = TextLMDataBunch.from_df(path, train_df=train_data, valid_df=valid_data)
data_clas = TextClasDataBunch.from_df(path, train_df=train_data, valid_df=valid_data, test_df=test_data, bs=4, vocab=data_lm.vocab)

learn = language_model_learner(data_lm, drop_mult=0.5, pretrained_model=URLs.WT103)
learn.fit_one_cycle(1, 1e-2)
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)
learn.save_encoder('encoder')

classifier = text_classifier_learner(data_clas, drop_mult=0.2)
classifier.load_encoder('encoder')

classifier.fit_one_cycle(1, 1e-2)
classifier.freeze_to(-2)
classifier.fit_one_cycle(1, slice(5e-3/2., 5e-3))
classifier.unfreeze()
classifier.fit_one_cycle(3, slice(2e-3/100, 2e-3))
classifier.save('classifier')

pred, _ = classifier.get_preds(DatasetType.Test, ordered=True)
preds_prob, preds_class = pred.max(1)

predict_df = test_data.text.apply(classifier.predict)
predict_df_class = [x[0].obj for x in predict_df.values]
predict_df_prob = [max(x[2].tolist()) for x in predict_df.values]

print(preds_class[:10])
print(predict_df_class[:10])
print(preds_prob[:10])
print(predict_df_prob[:10])

sample_nr = 5
sample_text = test_data.iloc[sample_nr].text
print(classifier.predict(sample_text))
print('{}, {}'.format(preds_class[sample_nr], pred[sample_nr]))
1 Like

They are supposed to return the same thing. I’ll look at this tomorrow.

Eh! It has taken me a long time to figure out why the results are different but the answer is very simple: when you use the test set, your texts are padded to be put together in a batch, and that padding is (not yet) ignored by the model. In predict, your text is alone so there is no padding needed.
There was also a small issue of not adding the BOS token at the beginning but I took care of that.

1 Like

Thanks so much for figuring this out, @sgugger!
Is it correct to conclude that so far the classifier.predict(text) results are “more reliable” because they don’t show these padding effects?

It depends on how you will use your model at inference time: will you have a lot of texts so feed them batch by batch? Or will you feed them one by one? Depending on which, you’ll trust one approach over the other.

@sgugger If i use batch to predict, Will it be possible to get the output for all time stamp.
By using that I can predict correct output.

For example,
I have two sentences,

1) You just need to complete your full profile so we can provide the best rates and terms for it

 2) I don't like this. 

Input for our model will be something like below.

“xxbos”, “You”, “just”, “need”, “to” , “complete” ,“your”, “full”, “profile” ,“so” , “we”, “can” ,“provide” ,“the”, “best”, “rates” , “and” , “terms”, “for” , “it”

“xxbos”, “I”, “don’t”, “like”, “this”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”, “xxpad”

As you can see getting output in the last timestamp for 2nd sentence will not be correct. Mostly we will get only neutral(In sentiment). If there is a way to get output after 5th timestamp( in our example after this), We will get the correct output.

Note that the padding is applied first. (so it would be xxpad xxpad… xxbos I in the second example). get_preds returns you the final output, so it doesn’t have anything about the timestamps anymore (it’s the two probabilities for positive/negative in the case of sentiment analysis).

1 Like

Thank you so much @sgugger. It works. But I am wondering how it works internally, is this because of the way we trained our model or something else?