Different results for .get_preds() and .predict()

get_preds doesn’t have ordered as an input parameter.

2 Likes

I’m getting rather large differences for predict and get_preds (ordered).
The differences are much larger than I’d find reasonable: (In one instance, 88.3% vs 85.5% acc leading to 71% F1 vs 77% F1). Looking at masked_concat_pool, I don’t really understand how differences can be explained by padding.

Is the understanding correct?:

  1. The forward model uses right padding.
  2. Thus, LSTM outputs for non-masked positions outputs should be identical with and without padding present.
  3. masked_concat_pool masks padded indices anyway for avg and max and uses the last non-masked output as third component of the concatenation.

Neither of these elements should depend on padding. Or am I missing something?
Is it possible, that the internal LSTM states are not reset between individual elements in the same way?

EDIT: it appears that 3. is only fully valid for fastaiv2. Looking around I eventually came across a 2nd variant of masked concat pool that, while it did use masking, just used the last hidden state and not the last unmasked one. Seems this has only been improved for v2.

Is there a reason why this hasn’t been applied as fix to v1 as well?

The internal state is reset at each forward pass (as seen here). In general the problem of matching predict to get_preds was a known issue in v1 that is solved in v2.

3 Likes

thanks a lot for the reply. Would you recommend we switch to v2 already for productive use of fastai? Or could we also monkey patch masked_concat_pool and retrain models in v1?

We are working on making fastai v2 production-ready right now, but it won’t be fully landed until a couple of weeks from now. I’d suggest waiting a little bit more before switching.

Hello, I have a few questions:

  • in fastai v1, do predict and get_preds also give different results for segmentation?
  • is iterating through predict() always giving better results than calling get_preds() or is it unpredictable?

(after reading a bunch of thread, I’m trying to decide whether I should call get_preds() or iterate through predict() :innocent:)

1 Like

I am also getting incorrect results for get_preds. I think there is a bug here.

Validation accuracy is reported during training at 72%; and this tallys with using predict. However get_preds reports only 40% accuracy which suggests it is sorted perhaps?

Hi, this solution worked for me:
I was using get_preds with ordered=True so I was getiing different predictions than using predict. So the solution is FIRST sort the dataframe by text length manually and THEN use get_preds with ordered=True
That gave me pretty accurate results
Hope it helps

What did you do to avoid it? did you use reordered? and what is it ordering by? Length of the each predicted tensor?

I’m using fastai v2 and I see inconsistent results when using get_preds. The issue is certainly not related to batching or to ordering. Eventually, I performed the below test that demonstrates the problem.

  • Created a data-frame with 4 texts where the first one had a very different prediction score when using get_preds and predict.
  • Iterating over the sentences and calling predict for each one separately returns the following scores:

(‘True’, tensor(1), tensor([0.0141, 0.9859]))
(‘True’, tensor(1), tensor([0.0390, 0.9610]))
(‘True’, tensor(1), tensor([0.1072, 0.8928]))
(‘True’, tensor(1), tensor([0.0899, 0.9101]))

  • Then, I moved to using get_preds. I created a loop where in the first call the first text was passed to get_preds (getting back one score). In the second call, the first two texts (getting back two scores), in the third 3 texts and in the fourth 4 texts. In all calls the batch size is 64. The scores of class 1 for the four iterations are enclosed below. In the first iteration the first text gets a score identical to predict, 0.986. In the second its score is 2% lower than predict. Then, in the third, its score drops close to zero and in the final iteration the score of the first text goes back to the value in the first.

[0.98589998]
[0.96604544 0.96099126]
[0.00979647 0.96099126 0.89280295]
[0.96604544 0.96099126 0.92748475 0.91006351]

To summarize, get_preds returns scores that are different than predict. In some cases, the scores are significantly different and change significantly depending on the set of texts.

Anyone observed this kind of problem before?

1 Like

Hi everyone,

I have a similar issue when comparing the predictions of get_preds and predict.

Here is an excerpt of the DataFrame I would like to predict sentiment on:

680678 anyone needs uber im available night seems
1214792 making uber driver play porches lmaoo hate
1149696 uber doors always opening oncoming traffic
952821 get 4 uber pool one else rides entire trip
112590 try works picked within 5 minutes download
434038 stranger walked house offered pay uber bar
775014 lyft driver trump supporterlmao didnt tell
910651 man lyft driver know wassup gassin airport
688478 today uber smells like grape cold medicine
705809 thanks reminder deleted mine well signedup

When running:
pred_dl_remaining = dls.test_dl(df_remaining)
preds_remaining = learn.get_preds(dl = pred_dl_remaining, reorder = False)

I obtain the following predictions:

(TensorText([[0.5953, 0.4047],
[0.7690, 0.2310],
[0.4722, 0.5278],
[0.3296, 0.6704],
[0.7166, 0.2834],
[0.7312, 0.2688],
[0.4502, 0.5498],
[0.3216, 0.6784],
[0.3784, 0.6216],
[0.0404, 0.9596]]), None)

pred_dl_remaining.items outputs the following:

anyone needs uber im available night seems
1214792 making uber driver play porches lmaoo hate
1149696 uber doors always opening oncoming traffic
952821 get 4 uber pool one else rides entire trip
112590 try works picked within 5 minutes download
434038 stranger walked house offered pay uber bar
775014 lyft driver trump supporterlmao didnt tell
910651 man lyft driver know wassup gassin airport
688478 today uber smells like grape cold medicine
705809 thanks reminder deleted mine well signedup

So as you can see, the order of the data is the same. However, when I predict the sentiment on the tweets manually by using

for tweet in df_remaining:
print(learn.predict(tweet))

I obtain the following predictions:

(‘4’, TensorText(1), TensorText([0.2028, 0.7972]))
(‘0’, TensorText(0), TensorText([0.8492, 0.1508]))
(‘0’, TensorText(0), TensorText([0.9548, 0.0452]))
(‘4’, TensorText(1), TensorText([0.0689, 0.9311]))
(‘4’, TensorText(1), TensorText([0.1367, 0.8633]))
(‘4’, TensorText(1), TensorText([0.0721, 0.9279]))
(‘0’, TensorText(0), TensorText([0.5962, 0.4038]))
(‘0’, TensorText(0), TensorText([0.5120, 0.4880]))
(‘0’, TensorText(0), TensorText([0.9658, 0.0342]))
(‘0’, TensorText(0), TensorText([0.5621, 0.4379]))

Could anyone please help explain why the predictions differ?

Thanks!

Ferris