Get_preds gives different result from predict

I tried to get the predictions with following code. And I found the result are different between get_preds and predict. here is my code:

df = pd.read_csv('/content/gdrive/My Drive/s2u.csv')
learn2 = load_learner('/content/gdrive/My Drive/','s2u')
subdata=df[:10]
df2=pd.DataFrame(list(zip( [learn2.predict(row)[1].tolist()[0] for row in subdata.itertuples()],subdata[dep_var].tolist())))
df2
0 1
0 1.190289 1.199965
1 1.181493 1.193922
2 1.089151 1.098612
3 1.092094 1.098612
4 0.888548 0.887891
5 0.773629 0.765468
6 0.448662 0.438255
7 0.274429 0.262364
8 0.060516 0.048790
9 -0.276954 -0.287682
data_test = (TabularList.from_df(df,cat_names=cat_names, cont_names=cont_names, procs=[Categorify, Normalize])
                           .split_none()
                           .label_from_df(cols=dep_var))
data_test.valid = data_test.train
data_test=data_test.databunch()
learn2.data.valid_dl = data_test.valid_dl
%time res=learn2.get_preds(ds_type=DatasetType.Valid)
odf=pd.DataFrame(list(zip([item for sublist in res[0].tolist() for item in sublist ],res[1].tolist())))
odf[:10]
0 1
0 1.190244 1.199965
1 1.181447 1.193923
2 1.089099 1.098612
3 1.092041 1.098612
4 0.888482 0.887891
5 0.773554 0.765468
6 0.448627 0.438255
7 0.274381 0.262364
8 0.060451 0.048790
9 -0.277029 -0.287682

why are they different and how to make them same?

The tables you pasted seem identical. Where is it different?

If you look closely they vary by very small amounts

Ah, your second table doesn’t use the processors of the training set, that’s why. It must have different stats for Normalize for instance, which is enough to explain why you have those differences.
Use the add_test method to create a test dataloader from your original DataBunch.

1 Like

Also, if you’re going for the method you’re doing (which should only be done if they’re a labeled set), you need to pass in the first DataBunches processor too when you’re doing your call to TabularList by doing:

cat_names=cat_vars…processor=data.processor)

1 Like

I was following your code in https://github.com/muellerzr/fastai-Experiments-and-tips/blob/master/Test%20Set%20Generation/Labeled_Test_Set.ipynb
So how do I pass the processor (of what to whom)? Can you show me the sample code?

So I assume the numbers by the predict is more accurate because it uses the normalizer from the trained data? How to use add_test?

It’s in the above post… IE:

TabularList(.......processor=data.processor)

where data is your original databunch

There is also an example of add_test in that notebook

Also, the TL/DR is wrong. If you followed to the " Train/Valid/Test Split - The proper way" section you would have caught this difference. (This needs to be changed but I’m unsure when I’ll get to it, as I’m fastai2 focused now and don’t intend on going back)

1 Like

cool. now I changed the code to following and got the numbers matching those from predict. Thanks!

data_test = (TabularList.from_df(df,cat_names=cat_names, cont_names=cont_names, procs=[Categorify, Normalize],processor=learn2.data.processor)
                           .split_none()
                           .label_from_df(cols=dep_var))
1 Like

Related topic: Do we need to replace data.processor when loading learner?

I’m migrating my code to V2 and wonder if I need to do the same. But I can’t find the conterparty of the processor in the previous version of from_df in V2:

TabularDataLoaders.from_df [source]

TabularDataLoaders.from_df(df, path='.', procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, valid_idx=None, bs=64, shuffle_train=None, shuffle=True, val_shuffle=False, n=None, device=None, drop_last=None, val_bs=None)

What is the right way to do so?