TabularDataBunch.from_df doesn't add / acknowledge test set

See full Kaggle kernel above. Not sure about 2nd part (if you mean categories in test set not in train set). In the Titanic case, this is not an issue but I would guess .add_test takes care of that.

1 Like

To clarify, by adding the test set using add_test() as per the code below, is the test set normalized according to the same parameters as the train/valid set? Or should it be added after data has been initialised?

Also, on a possibly related note, following code throws up a warning when the highlighted part is added and I’m not sure how to locate the problematic part.

When you use add_test, the transforms (FillMissing, Normalize and Categorify) always use the state determined on the training set.
You should pass cat_names and cont_names too when you create your TabularList for the test set, this may be why you have the warning.

1 Like

How is the output called and used? From the above approach I believe I have trained columnar data, but am unsure how to verify or apply the results. I have:

test_df = dftest
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_names, cont_names=cont_names))
.databunch())

And then I train the learner and try to get preds with:

preds = learn.get_preds(ds_type=‘Test’)

Which gives me:
[tensor([[0.5986],
[0.8231],
[0.7974],
…,

I am confused about the output. Is this the log of the dependent variable? Have I made a mistake elsewhere?

1 Like

Yes, it’s the log of the dependent variable (since you put log=True), as predicted by your model.

1 Like

Of course! Thank you!

I suppose that the correct way is something as:

data = (TabularList.from_df(train_df, path=path, cat_names=cat_vars,
cont_names=cont_vars+bools_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs))
.databunch())

But in doing so a get the following exception:

Exception: There are nan values in field 'field_name' but there were none in the training set. 
                Please fix those manually.

How to fix?
Thanks.

1 Like

As indicated by the error message, you have nans in your validation or test set in a field where the training set had none (need to fix the error message so that it show the name). You need to fix them manually since the processor state is fixed by the training set, so either introduce some NaNs in your training set in this column, or remove the ones in the validation/test set.

How add labels to test dataset?
If I specify:

data = (TabularList.from_df(train_df, path=path, cat_names=cat_vars,
cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars,
cont_names=cont_vars, procs=procs)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True))
.databunch())

I get an error using

data.show_batch(4, ds_type=DatasetType.Test)

AttributeError: ‘TabularList’ object has no attribute ‘codes’

What’s wrong?
Thanks again.

In fastai the test set is not labelled. It’s to quickly get predictions on a lot of inputs. If you want to validate on a second test, you should create a second data object with this validation test as documented here.

1 Like

I’m receiving an error here. Is it deprecated? What’s the new way?

data = (TabularList.from_df(df, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(valid_idx)
                   .add_test(TabularList.from_df(df_test, cat_names=cat_vars, cont_names=cont_vars, procs=procs))
                   .label_from_df(cols=dep_var, label_cls=CategoryList)
                   .databunch())


AttributeError: 'TabularList' object has no attribute 'add_test'
1 Like

Hello,

Has anyone been able to get predictions on the test set to work on the latest version? (1.0.42)

With:

data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=test_index, test_df=test_df, 
                            procs=procs, cat_names=cat_names, 
                            cont_names=cont_names)

If learn.get_preds(data.test_ds) is tried, I get predictions in the shape of the training set.

If learn.get_preds(), it’s in the shape of the validation set.

If it’s learn.predict(data.test_ds), I get a key error corresponding to one of the columns.

Any help would be much appreciated!

It’s learn.get_preds(ds_type=DatasetType.Test) (as can be seen in the docs).

2 Likes

add_test should go after label_from_df. It will resolve your error
AttributeError: 'TabularList' object has no attribute 'add_test'

Hi :slight_smile:
I’m working on a Time Series forecasting problem using fastai tabular dataset as instructed in the Rossmann challenge.
When using the same data and model’s parameters during two different training, I’m having a completely different result for a given point.

Thinking is maybe due to the randomized batch selection during training, my idea is to first fix the batch selection for different model training.

The following code lines will give different results when ran twice, how can I fix this?
Is there a parameter to fix this? I saw fix_dl=None but don’t know how to properly use it.

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, log=True)
.add_test(TabularList.from_df(test, path=path, cat_names=cat_vars, cont_names=cont_vars))
.databunch()) ;
data.show_batch(10)

Thank you

When generating the training loader, it’s shuffled and the last batch (if not complete) is dropped. If you try showing a batch from the validation set it should always be the same.

1 Like

Thank you for your reply !

so sgugger, the function of PROCS[FillMissing] coulnd not fill the NANs in data?

if the optional parameter, test_df=df_test is given in TextLMDatabunch and TextClasDatabunch, how can we load a pretrained model and ask the lrnsavedmodel.getpreds() to use this df_test that has been given already.
I get a NoneType error when I use get_preds(ds_type=Dataset.Type.test)

However, when I add the df_test via load_learner or add_test, it works.
what is the point in having an option of specifying df_test in databunch if it cannot be used in predict unless added again via load_learner or add_test

It’s ds_type=DatasetType.test (Without the first dot)