jerron
(jerron)
March 1, 2020, 6:16am
1
I tried to get the predictions with following code. And I found the result are different between get_preds
and predict
. here is my code:
df = pd.read_csv('/content/gdrive/My Drive/s2u.csv')
learn2 = load_learner('/content/gdrive/My Drive/','s2u')
subdata=df[:10]
df2=pd.DataFrame(list(zip( [learn2.predict(row)[1].tolist()[0] for row in subdata.itertuples()],subdata[dep_var].tolist())))
df2
0
1
0
1.190289
1.199965
1
1.181493
1.193922
2
1.089151
1.098612
3
1.092094
1.098612
4
0.888548
0.887891
5
0.773629
0.765468
6
0.448662
0.438255
7
0.274429
0.262364
8
0.060516
0.048790
9
-0.276954
-0.287682
data_test = (TabularList.from_df(df,cat_names=cat_names, cont_names=cont_names, procs=[Categorify, Normalize])
.split_none()
.label_from_df(cols=dep_var))
data_test.valid = data_test.train
data_test=data_test.databunch()
learn2.data.valid_dl = data_test.valid_dl
%time res=learn2.get_preds(ds_type=DatasetType.Valid)
odf=pd.DataFrame(list(zip([item for sublist in res[0].tolist() for item in sublist ],res[1].tolist())))
odf[:10]
0
1
0
1.190244
1.199965
1
1.181447
1.193923
2
1.089099
1.098612
3
1.092041
1.098612
4
0.888482
0.887891
5
0.773554
0.765468
6
0.448627
0.438255
7
0.274381
0.262364
8
0.060451
0.048790
9
-0.277029
-0.287682
why are they different and how to make them same?
The tables you pasted seem identical. Where is it different?
If you look closely they vary by very small amounts
Ah, your second table doesn’t use the processors of the training set, that’s why. It must have different stats for Normalize for instance, which is enough to explain why you have those differences.
Use the add_test
method to create a test dataloader from your original DataBunch.
1 Like
muellerzr
(Zachary Mueller)
March 1, 2020, 3:52pm
5
Also, if you’re going for the method you’re doing (which should only be done if they’re a labeled set), you need to pass in the first DataBunches processor too when you’re doing your call to TabularList by doing:
cat_names=cat_vars…processor=data.processor)
1 Like
jerron
(jerron)
March 1, 2020, 11:21pm
6
muellerzr:
Also, if you’re going for the method you’re doing (which should only be done if they’re a labeled set), you need to pass in the first DataBunches processor too when you’re doing your call to TabularList by doing:
cat_names=cat_vars…processor=data.processor)
I was following your code in fastai-Experiments-and-tips/Test Set Generation/Labeled_Test_Set.ipynb at master · muellerzr/fastai-Experiments-and-tips · GitHub
So how do I pass the processor (of what to whom)? Can you show me the sample code?
jerron
(jerron)
March 1, 2020, 11:23pm
7
sgugger:
Ah, your second table doesn’t use the processors of the training set, that’s why. It must have different stats for Normalize for instance, which is enough to explain why you have those differences.
Use the add_test
method to create a test dataloader from your original DataBunch.
So I assume the numbers by the predict is more accurate because it uses the normalizer from the trained data? How to use add_test?
muellerzr
(Zachary Mueller)
March 1, 2020, 11:24pm
8
It’s in the above post… IE:
TabularList(.......processor=data.processor)
where data
is your original databunch
There is also an example of add_test
in that notebook
Also, the TL/DR is wrong. If you followed to the " Train/Valid/Test Split - The proper way" section you would have caught this difference. (This needs to be changed but I’m unsure when I’ll get to it, as I’m fastai2 focused now and don’t intend on going back)
1 Like
jerron
(jerron)
March 1, 2020, 11:32pm
9
muellerzr:
processor=data.processor
cool. now I changed the code to following and got the numbers matching those from predict
. Thanks!
data_test = (TabularList.from_df(df,cat_names=cat_names, cont_names=cont_names, procs=[Categorify, Normalize],processor=learn2.data.processor)
.split_none()
.label_from_df(cols=dep_var))
1 Like
jerron
(jerron)
June 23, 2021, 5:51am
10
Related topic: Do we need to replace data.processor when loading learner?
I’m migrating my code to V2 and wonder if I need to do the same. But I can’t find the conterparty of the processor in the previous version of from_df
in V2:
TabularDataLoaders.from_df
[source]
TabularDataLoaders.from_df
(df
, path
='.'
, procs
=None
, cat_names
=None
, cont_names
=None
, y_names
=None
, y_block
=None
, valid_idx
=None
, bs
=64
, shuffle_train
=None
, shuffle
=True
, val_shuffle
=False
, n
=None
, device
=None
, drop_last
=None
, val_bs
=None
)
What is the right way to do so?