TabularDataBunch.from_df doesn't add / acknowledge test set


(Clemens Adolphs) #1

Using the Titanic Kaggle datasets (train.csv and test.csv), I am trying to do very simple tabular data NN work.
I am following the example from the fast.ai docs pretty much exactly, but when creating the TabularDataBunch I am using this line:

data = TabularDataBunch.from_df(path, df_out, dep_var, test_ds=test_df, valid_idx=range(100), procs=procs, cat_names=cat_vars)

However, I note that data.test_ds is None. This makes sense, because in the class definition of TabularDataBunch, we have

@classmethod
def from_df(cls, path, df:DataFrame, dep_var:str, valid_idx:Collection[int], procs:OptTabTfms=None,
                cat_names:OptStrList=None, cont_names:OptStrList=None, classes:Collection=None, **kwargs)->DataBunch:
    "Create a `DataBunch` from train/valid/test dataframes."
    cat_names = ifnone(cat_names, [])
    cont_names = ifnone(cont_names, list(set(df)-set(cat_names)-{dep_var}))
    procs = listify(procs)
    return (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)                     
        .split_by_idx(valid_idx)
        .label_from_df(cols=dep_var, classes=None)
        .databunch())

And, well, test_ds doesn’t show up in any of the arguments, and so it would get passed in kwags, but I note that in the body, kwargs doesn’t get used anywhere.

The result of this is that, further down the road, I can’t do

learn.get_preds(DatasetTypes.Test) because I’ll get an error about NoneType not supporting the action.


Tabular data with test set
#2

test_ds isn’t supported in this function, you should use the data block API to add it by adding:

add_test(TabularList.from_df(test_df, path=path, cat_names=cat_names, cont_names=cont_names))

between the label and the databunch line.


(Clemens Adolphs) #3

That sounds like the library itself needs that modification, right? As in, for someone just using the library without changing its source code, there is no way to get a test set added to the TabularDataBunch?


#4

It mostly sounds like you need to learn the data block API :wink:
The basic factory methods are only there to get started quickly. To make any modification to them, you should use the data block API: just copy-paste the source code and modify the line you need.


(Clemens Adolphs) #5

Fair enough; I just would have expected, from a design point of view, that loading your train and test data in one go, would be a basic use case necessary to get started quickly, much like the earlier version of fast.ai used to do it.

The data block api seems very nice indeed, much like that grammar-of-graphics plotting approach.


(Thomas) #6

Hi :slight_smile:

I just tried what ‘sgugger’ said about ‘add_test’ but it throws me an error.

Added in the fastai library:

My Code:

Error Message:

Can someone tell me what I did wrong!?


(Clemens Adolphs) #7

I get it to work like so:

il = (TabularList.from_csv(path,'train.csv',header='infer',cols=cols, cont_names=cont_names, cat_names=cat_names, procs=procs)
      .random_split_by_pct()
      .label_from_df(cols='Survived')
      .add_test(TabularList.from_csv(path, 'test.csv')))

Note how add_test does not have to specify any of the procs, cat names, cont vars etc again, because they are inferred from the train set. So try your code again but with add_test stripped to the bare minimum of providing the path or data frame.


(Thomas) #8

Hi,

thanks for your reply and help!
Well, I tried it to pass the bare minimum (a data frame) but
it still does not work :frowning: Same error.

Here it breaks --> self.dl seems to be null/empty


(Abu Fadl) #9

I am working on Titanic and got it to work like so:

        train_df = shuffle(train_df)
        dep_var = 'Survived'
        cat_names = ['Pclass', 'Sex', 'Alone', 'SibCh', 'Embarked'] # 'AgeGroup', 'FareGroup',
        cont_names = ['Age', 'Fare', 'Relatives' ] 
        procs = [FillMissing, Categorify, Normalize]
        test = TabularList.from_df(test_df, cat_names=cat_names, cont_names=cont_names, procs=procs)
        data = (TabularList.from_df(train_df, path='.', cat_names=cat_names, cont_names=cont_names, procs=procs)
                                    .split_by_idx(valid_idx=range(len(train_df)-175,len(train_df)))
                                   .label_from_df(cols=dep_var)
                                   .add_test(test, label=0)
                                   .databunch())
        emb_szs={'Pclass':6}
        learn = tabular_learner(data, layers=[60,40], emb_szs= emb_szs,  metrics=accuracy) 
        learn.lr_find()
        learn.recorder.plot()
        lr = 1e-1
        learn.fit_one_cycle(4, lr)

Code copied from colab (Kaggle only works with cpu and unstable at the moment - for me at least). Some features are made up.

PS: learn.show_results() doesn’t seem to work # https://docs.fast.ai/tutorial.data.html
I am trying to find out how to configure embedding (best practices) for categories. I saw examples ignoring emb_szs and others using it. What approximate criteria to use?
Update: found emb_szs hint at https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py and posted Kaggle kernel: https://www.kaggle.com/abedkhooli/fastai-titanic (using cpu, torch install takes too much time).


(nok) #10

Can you share how to called prediction on test set? Does unseen token in test set handle in the embedding.


(Abu Fadl) #11

See full Kaggle kernel above. Not sure about 2nd part (if you mean categories in test set not in train set). In the Titanic case, this is not an issue but I would guess .add_test takes care of that.


#12

To clarify, by adding the test set using add_test() as per the code below, is the test set normalized according to the same parameters as the train/valid set? Or should it be added after data has been initialised?

Also, on a possibly related note, following code throws up a warning when the highlighted part is added and I’m not sure how to locate the problematic part.


#13

When you use add_test, the transforms (FillMissing, Normalize and Categorify) always use the state determined on the training set.
You should pass cat_names and cont_names too when you create your TabularList for the test set, this may be why you have the warning.


(Matthew Arthur) #15

How is the output called and used? From the above approach I believe I have trained columnar data, but am unsure how to verify or apply the results. I have:

test_df = dftest
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_names, cont_names=cont_names))
.databunch())

And then I train the learner and try to get preds with:

preds = learn.get_preds(ds_type=‘Test’)

Which gives me:
[tensor([[0.5986],
[0.8231],
[0.7974],
…,

I am confused about the output. Is this the log of the dependent variable? Have I made a mistake elsewhere?


#16

Yes, it’s the log of the dependent variable (since you put log=True), as predicted by your model.


(Matthew Arthur) #17

Of course! Thank you!


(Sergio Ferlito) #18

I suppose that the correct way is something as:

data = (TabularList.from_df(train_df, path=path, cat_names=cat_vars,
cont_names=cont_vars+bools_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs))
.databunch())

But in doing so a get the following exception:

Exception: There are nan values in field 'field_name' but there were none in the training set. 
                Please fix those manually.

How to fix?
Thanks.


#19

As indicated by the error message, you have nans in your validation or test set in a field where the training set had none (need to fix the error message so that it show the name). You need to fix them manually since the processor state is fixed by the training set, so either introduce some NaNs in your training set in this column, or remove the ones in the validation/test set.


(Sergio Ferlito) #20

How add labels to test dataset?
If I specify:

data = (TabularList.from_df(train_df, path=path, cat_names=cat_vars,
cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars,
cont_names=cont_vars, procs=procs)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True))
.databunch())

I get an error using

data.show_batch(4, ds_type=DatasetType.Test)

AttributeError: ‘TabularList’ object has no attribute ‘codes’

What’s wrong?
Thanks again.


#21

In fastai the test set is not labelled. It’s to quickly get predictions on a lot of inputs. If you want to validate on a second test, you should create a second data object with this validation test as documented here.