Get_preds gives different result from predict

jerron · March 1, 2020, 6:16am

I tried to get the predictions with following code. And I found the result are different between get_preds and predict. here is my code:

df = pd.read_csv('/content/gdrive/My Drive/s2u.csv')
learn2 = load_learner('/content/gdrive/My Drive/','s2u')
subdata=df[:10]
df2=pd.DataFrame(list(zip( [learn2.predict(row)[1].tolist()[0] for row in subdata.itertuples()],subdata[dep_var].tolist())))
df2

	0	1
0	1.190289	1.199965
1	1.181493	1.193922
2	1.089151	1.098612
3	1.092094	1.098612
4	0.888548	0.887891
5	0.773629	0.765468
6	0.448662	0.438255
7	0.274429	0.262364
8	0.060516	0.048790
9	-0.276954	-0.287682

data_test = (TabularList.from_df(df,cat_names=cat_names, cont_names=cont_names, procs=[Categorify, Normalize])
                           .split_none()
                           .label_from_df(cols=dep_var))
data_test.valid = data_test.train
data_test=data_test.databunch()
learn2.data.valid_dl = data_test.valid_dl
%time res=learn2.get_preds(ds_type=DatasetType.Valid)
odf=pd.DataFrame(list(zip([item for sublist in res[0].tolist() for item in sublist ],res[1].tolist())))
odf[:10]

	0	1
0	1.190244	1.199965
1	1.181447	1.193923
2	1.089099	1.098612
3	1.092041	1.098612
4	0.888482	0.887891
5	0.773554	0.765468
6	0.448627	0.438255
7	0.274381	0.262364
8	0.060451	0.048790
9	-0.277029	-0.287682

why are they different and how to make them same?

sgugger · March 1, 2020, 3:03pm

The tables you pasted seem identical. Where is it different?

mschmit5 · March 1, 2020, 3:06pm

If you look closely they vary by very small amounts

sgugger · March 1, 2020, 3:15pm

Ah, your second table doesn’t use the processors of the training set, that’s why. It must have different stats for Normalize for instance, which is enough to explain why you have those differences.
Use the add_test method to create a test dataloader from your original DataBunch.

muellerzr · March 1, 2020, 3:52pm

Also, if you’re going for the method you’re doing (which should only be done if they’re a labeled set), you need to pass in the first DataBunches processor too when you’re doing your call to TabularList by doing:

cat_names=cat_vars…processor=data.processor)

jerron · March 1, 2020, 11:21pm

I was following your code in fastai-Experiments-and-tips/Test Set Generation/Labeled_Test_Set.ipynb at master · muellerzr/fastai-Experiments-and-tips · GitHub
So how do I pass the processor (of what to whom)? Can you show me the sample code?

jerron · March 1, 2020, 11:23pm

So I assume the numbers by the predict is more accurate because it uses the normalizer from the trained data? How to use add_test?

muellerzr · March 1, 2020, 11:24pm

It’s in the above post… IE:

TabularList(.......processor=data.processor)

where data is your original databunch

There is also an example of add_test in that notebook

Also, the TL/DR is wrong. If you followed to the " Train/Valid/Test Split - The proper way" section you would have caught this difference. (This needs to be changed but I’m unsure when I’ll get to it, as I’m fastai2 focused now and don’t intend on going back)

jerron · March 1, 2020, 11:32pm

cool. now I changed the code to following and got the numbers matching those from predict. Thanks!

data_test = (TabularList.from_df(df,cat_names=cat_names, cont_names=cont_names, procs=[Categorify, Normalize],processor=learn2.data.processor)
                           .split_none()
                           .label_from_df(cols=dep_var))

jerron · June 23, 2021, 5:51am

Related topic: Do we need to replace data.processor when loading learner?

I’m migrating my code to V2 and wonder if I need to do the same. But I can’t find the conterparty of the processor in the previous version of from_df in V2:

`TabularDataLoaders.from_df` [source]

TabularDataLoaders.from_df(df, path='.', procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, valid_idx=None, bs=64, shuffle_train=None, shuffle=True, val_shuffle=False, n=None, device=None, drop_last=None, val_bs=None)

What is the right way to do so?

Get_preds gives different result from predict

TabularDataLoaders.from_df [source]

`TabularDataLoaders.from_df` [source]