TabularDataBunch.from_df doesn't add / acknowledge test set

cadolphs · November 20, 2018, 9:49pm

Using the Titanic Kaggle datasets (train.csv and test.csv), I am trying to do very simple tabular data NN work.
I am following the example from the fast.ai docs pretty much exactly, but when creating the TabularDataBunch I am using this line:

data = TabularDataBunch.from_df(path, df_out, dep_var, test_ds=test_df, valid_idx=range(100), procs=procs, cat_names=cat_vars)

However, I note that data.test_ds is None. This makes sense, because in the class definition of TabularDataBunch, we have

@classmethod
def from_df(cls, path, df:DataFrame, dep_var:str, valid_idx:Collection[int], procs:OptTabTfms=None,
                cat_names:OptStrList=None, cont_names:OptStrList=None, classes:Collection=None, **kwargs)->DataBunch:
    "Create a `DataBunch` from train/valid/test dataframes."
    cat_names = ifnone(cat_names, [])
    cont_names = ifnone(cont_names, list(set(df)-set(cat_names)-{dep_var}))
    procs = listify(procs)
    return (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)                     
        .split_by_idx(valid_idx)
        .label_from_df(cols=dep_var, classes=None)
        .databunch())

And, well, test_ds doesn’t show up in any of the arguments, and so it would get passed in kwags, but I note that in the body, kwargs doesn’t get used anywhere.

The result of this is that, further down the road, I can’t do

learn.get_preds(DatasetTypes.Test) because I’ll get an error about NoneType not supporting the action.

sgugger · November 20, 2018, 10:26pm

test_ds isn’t supported in this function, you should use the data block API to add it by adding:

add_test(TabularList.from_df(test_df, path=path, cat_names=cat_names, cont_names=cont_names))

between the label and the databunch line.

cadolphs · November 20, 2018, 10:35pm

That sounds like the library itself needs that modification, right? As in, for someone just using the library without changing its source code, there is no way to get a test set added to the TabularDataBunch?

sgugger · November 20, 2018, 10:47pm

It mostly sounds like you need to learn the data block API
The basic factory methods are only there to get started quickly. To make any modification to them, you should use the data block API: just copy-paste the source code and modify the line you need.

cadolphs · November 21, 2018, 12:15am

Fair enough; I just would have expected, from a design point of view, that loading your train and test data in one go, would be a basic use case necessary to get started quickly, much like the earlier version of fast.ai used to do it.

The data block api seems very nice indeed, much like that grammar-of-graphics plotting approach.

oxyd33 · November 21, 2018, 5:20pm

Hi

I just tried what ‘sgugger’ said about ‘add_test’ but it throws me an error.

Added in the fastai library:

My Code:

Error Message:

Can someone tell me what I did wrong!?

cadolphs · November 21, 2018, 6:44pm

I get it to work like so:

il = (TabularList.from_csv(path,'train.csv',header='infer',cols=cols, cont_names=cont_names, cat_names=cat_names, procs=procs)
      .random_split_by_pct()
      .label_from_df(cols='Survived')
      .add_test(TabularList.from_csv(path, 'test.csv')))

Note how add_test does not have to specify any of the procs, cat names, cont vars etc again, because they are inferred from the train set. So try your code again but with add_test stripped to the bare minimum of providing the path or data frame.

oxyd33 · November 26, 2018, 2:15pm

Hi,

thanks for your reply and help!
Well, I tried it to pass the bare minimum (a data frame) but
it still does not work Same error.

Here it breaks --> self.dl seems to be null/empty

AbuFadl · November 27, 2018, 3:07pm

I am working on Titanic and got it to work like so:

        train_df = shuffle(train_df)
        dep_var = 'Survived'
        cat_names = ['Pclass', 'Sex', 'Alone', 'SibCh', 'Embarked'] # 'AgeGroup', 'FareGroup',
        cont_names = ['Age', 'Fare', 'Relatives' ] 
        procs = [FillMissing, Categorify, Normalize]
        test = TabularList.from_df(test_df, cat_names=cat_names, cont_names=cont_names, procs=procs)
        data = (TabularList.from_df(train_df, path='.', cat_names=cat_names, cont_names=cont_names, procs=procs)
                                    .split_by_idx(valid_idx=range(len(train_df)-175,len(train_df)))
                                   .label_from_df(cols=dep_var)
                                   .add_test(test, label=0)
                                   .databunch())
        emb_szs={'Pclass':6}
        learn = tabular_learner(data, layers=[60,40], emb_szs= emb_szs,  metrics=accuracy) 
        learn.lr_find()
        learn.recorder.plot()
        lr = 1e-1
        learn.fit_one_cycle(4, lr)

Code copied from colab (Kaggle only works with cpu and unstable at the moment - for me at least). Some features are made up.

PS: learn.show_results() doesn’t seem to work # https://docs.fast.ai/tutorial.data.html
I am trying to find out how to configure embedding (best practices) for categories. I saw examples ignoring emb_szs and others using it. What approximate criteria to use?
Update: found emb_szs hint at https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py and posted Kaggle kernel: https://www.kaggle.com/abedkhooli/fastai-titanic (using cpu, torch install takes too much time).

nok · November 28, 2018, 9:02am

Can you share how to called prediction on test set? Does unseen token in test set handle in the embedding.

AbuFadl · November 28, 2018, 9:46am

See full Kaggle kernel above. Not sure about 2nd part (if you mean categories in test set not in train set). In the Titanic case, this is not an issue but I would guess .add_test takes care of that.

AdrianT · December 24, 2018, 5:35pm

To clarify, by adding the test set using add_test() as per the code below, is the test set normalized according to the same parameters as the train/valid set? Or should it be added after data has been initialised?

Also, on a possibly related note, following code throws up a warning when the highlighted part is added and I’m not sure how to locate the problematic part.

sgugger · December 26, 2018, 10:38am

When you use add_test, the transforms (FillMissing, Normalize and Categorify) always use the state determined on the training set.
You should pass cat_names and cont_names too when you create your TabularList for the test set, this may be why you have the warning.

matthewarthur · January 6, 2019, 1:55pm

How is the output called and used? From the above approach I believe I have trained columnar data, but am unsure how to verify or apply the results. I have:

test_df = dftest
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_names, cont_names=cont_names))
.databunch())

And then I train the learner and try to get preds with:

preds = learn.get_preds(ds_type=‘Test’)

Which gives me:
[tensor([[0.5986],
[0.8231],
[0.7974],
…,

I am confused about the output. Is this the log of the dependent variable? Have I made a mistake elsewhere?

sgugger · January 6, 2019, 9:40pm

Yes, it’s the log of the dependent variable (since you put log=True), as predicted by your model.

matthewarthur · January 6, 2019, 9:58pm

Of course! Thank you!

EanX · January 9, 2019, 3:35pm

I suppose that the correct way is something as:

data = (TabularList.from_df(train_df, path=path, cat_names=cat_vars,
cont_names=cont_vars+bools_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs))
.databunch())

But in doing so a get the following exception:

Exception: There are nan values in field 'field_name' but there were none in the training set. 
                Please fix those manually.

How to fix?
Thanks.

sgugger · January 9, 2019, 3:37pm

As indicated by the error message, you have nans in your validation or test set in a field where the training set had none (need to fix the error message so that it show the name). You need to fix them manually since the processor state is fixed by the training set, so either introduce some NaNs in your training set in this column, or remove the ones in the validation/test set.

EanX · January 10, 2019, 1:20pm

How add labels to test dataset?
If I specify:

data = (TabularList.from_df(train_df, path=path, cat_names=cat_vars,
cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars,
cont_names=cont_vars, procs=procs)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True))
.databunch())

I get an error using

data.show_batch(4, ds_type=DatasetType.Test)

AttributeError: ‘TabularList’ object has no attribute ‘codes’

What’s wrong?
Thanks again.

sgugger · January 10, 2019, 2:24pm

In fastai the test set is not labelled. It’s to quickly get predictions on a lot of inputs. If you want to validate on a second test, you should create a second data object with this validation test as documented here.