Tabular Data - Problem with test set

I am creating a TabularList for a TabularLearner this way:

procs = [FillMissing, Categorify, Normalize]
test = TabularList.from_df(test_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars,)

data = (TabularList.from_df(df=train_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                              #.split_by_rand_pct(valid_pct=0.2)
                              .split_by_idx(list(range(0,10000)))
                              .label_from_df(cols=dep_var, label_cls=None)
                              .add_test(test, label=0)
                              .databunch())

But I get a KeyError. It looks for the dependent column in the test set which isn’t there…

If I remove the .add_test(...) line, everything works fine?! How can I add the test-data?

When I try to predict the y values for the test-set later on, I get a different number of values:

predictions, *_ = learner.get_preds(test)
labels = np.argmax(predictions, 1)
len(test_df_small), len(labels)

(50669, 47244)

I have worked through many threads in the forum but I keep being stuck…

When creating your databunch, your test set ( which I think it’s better it is called validation set for clarity sake must have same header as your train data). When you split data, you’re telling your learning to train on say 80% of your data and to verify itself on say 20% ( this is what you’re doing when you call add.test()) .

After you train the model, you can now pass new dataset that the model as not seen before similar to what you tried doing … the dataset doesn’t need to have a column for the dependent variable ( what you’re trying to predict )

This way, you’ll get the answer you’re looking for

BOTTOMLINE: the add.test in the initial databunch is an internal validation set.

1 Like

Ok, now it makes sense, that I need the dependent column but I find the name kind of misleading. I thought the validationset is created from the train-set by using one of the split_... methods and the add_test.... method is for adding real test set with data, the learner has not seen during training.

1 Like

Good luck to you…

Is your test data unlabeled or labeled?

Seem’s to be a bit tricky.

Now I removed the add_test() method from data:

test = (TabularList.from_df(test_df_small, path=BASE_PATH/'model')) 
data = (TabularList.from_df(df=train_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                              .split_by_rand_pct(valid_pct=0.2)
                              #.split_by_idx(list(range(0,10000)))
                              .label_from_df(cols=dep_var)
                              #.add_test(test)
                              .databunch())

Afterwards create the prediction for the testdata:

predictions = learner.get_preds(test)

But it seems that my learner doesn’t create predictions for each entry in the testset:

len(test_df_small), len(predictions[0])

Output:

(50669, 47244)

Shouldn’t they have the same length?

The testdata is unlabeled

In that case, what you’ll want to do is something like this:

data = (TabularList.from_df(df=train_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                              .split_by_rand_pct(valid_pct=0.2)
                              .label_from_df(cols=dep_var)
                              .add_test(TabularList.from_df(test_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars))
                              .databunch())

And to grab predictions do

preds = learn.get_preds(DatasetType.Test)

I am sorry but I had exactly this in my first attempt (except for creating the test TabularList outside of the .add_test function.

If I ran this code snippet I get the key-error (where isFraud is the dependent column that is only present within the training data:

KeyError: 'isFraud'

Full Stack trace:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'isFraud'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
7 frames
<ipython-input-17-fd664806ff07> in <module>()
      2                               .split_by_rand_pct(valid_pct=0.2)
      3                               .label_from_df(cols=dep_var)
----> 4                               .add_test(TabularList.from_df(test_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars))
      5                               .databunch())

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in add_test(self, items, label, tfms, tfm_y)
    558         else: labels = self.valid.y.new([label] * len(items)).process()
    559         if isinstance(items, MixedItemList): items = self.valid.x.new(items.item_lists, inner_df=items.inner_df).process()
--> 560         elif isinstance(items, ItemList): items = self.valid.x.new(items.items, inner_df=items.inner_df).process()
    561         else: items = self.valid.x.new(items).process()
    562         self.test = self.valid.new(items, labels, tfms=tfms, tfm_y=tfm_y)

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, processor)
     81         if processor is not None: self.processor = processor
     82         self.processor = listify(self.processor)
---> 83         for p in self.processor: p.process(self)
     84         return self
     85 

/usr/local/lib/python3.6/dist-packages/fastai/tabular/data.py in process(self, ds)
     62             return
     63         for i,proc in enumerate(self.procs):
---> 64             if isinstance(proc, TabularProc): proc(ds.inner_df, test=True)
     65             else:
     66                 #cat and cont names may have been changed by transform (like Fill_NA)

/usr/local/lib/python3.6/dist-packages/fastai/tabular/transform.py in __call__(self, df, test)
    122         "Apply the correct function to `df` depending on `test`."
    123         func = self.apply_test if test else self.apply_train
--> 124         func(df)
    125 
    126     def apply_train(self, df:DataFrame):

/usr/local/lib/python3.6/dist-packages/fastai/tabular/transform.py in apply_test(self, df)
    175                     if name+'_na' not in self.cat_names: self.cat_names.append(name+'_na')
    176                 df[name] = df[name].fillna(self.na_dict[name])
--> 177             elif pd.isnull(df[name]).sum() != 0:
    178                 raise Exception(f"""There are nan values in field {name} but there were none in the training set. 
    179                 Please fix those manually.""")

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657                 return self._engine.get_loc(key)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'isFraud'
1 Like

Ah my bad :slight_smile: in that case create a dummy column in your test data called isFraud. If that still isn’t working, make sure that your cat and cont vars don’t have it in there by accident. Worst case I can send you my fastai kernel for this competition on Kaggle :wink:

No prob ;=)
I had added this isFraud column but was confused by this post:

because if the learner takes this test-set for valuation while learning, there shouldn’t be a column with just 0-values ?!

Another issue with this competition: Did you write your own "Under-The-ROC-"metric function or do you use the fast.ai standard roc function? Because if I use the fast.ai function I ran into CUDA errors on Colab? Did you experience similar behaviour?

 learner = tabular_learner(data, layers=[2000,3000, 1000], ps=[0.001,0.01, 0.01], emb_drop=0.04, metrics=roc_curve, callback_fns=ShowGraph)

–> Training starts but interrupts:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-76-08fc7ea4f26c> in <module>()
----> 1 learner.fit_one_cycle(3, 1e-2, wd=0.2)
8 frames
/usr/local/lib/python3.6/dist-packages/fastai/metrics.py in roc_curve(input, targ)
292     threshold_idxs = torch.cat((distinct_value_indices, LongTensor([len(targ) - 1]).to(targ.device)))
293     tps = torch.cumsum(targ * 1, dim=-1)[threshold_idxs]
--> 294     fps = (1 + threshold_idxs - tps)
295     if tps[0] != 0 or fps[0] != 0:
296         fps = torch.cat((LongTensor([0]), fps))

RuntimeError: The size of tensor a (9) must match the size of tensor b (2) at non-singleton dimension 1

The doc say: Restricted binary classification tasks.: So I think this results in the tensorsize mismatch error. But this project should be a binary classifier (either isFraud or !isFraud)…

Now I have managed to get rid of the tensor size error by using AUROC() as a metric (instead of auc_roc_score)

learner = tabular_learner(data, layers=[3000,1000, 20], ps=[0.001,0.001, 0.01], y_range=y_range, emb_drop=0.1, metrics=[accuracy, AUROC()], callback_fns=ShowGraph)

But when doing the prediction on the test set I get a new error…
See my notebook on github: https://github.com/we-make-ai/ieee-cis-fraud-detection

@muellerzr By the way, would you mind sharing your notebook?

Hi! I managed to ran a full training on my local machine. Seems that something wired goes on with colab…

You can check out my kernel on github. Well, no feature engineering just a bit of tweaking layer sizes and embedding drop out.

Do you have any suggestions how I could tune the tabular learner further?

@ulat @muellerzr can you share the auroc test lb score you were able to get with fastai tabular model on this kaggle dataset? I’m using it to practice tabular and wondering how good is good enough - my starter model without any feature engineering scored 0.8789 on public lb.

Hi!
Here I did some tests on the training data: https://docs.google.com/spreadsheets/d/1_WeQN0zkSMVjlRclZLwwm2zC5usUhriNZGCLysyZ86Q/edit?usp=sharing
The max score I could get on the public lb is: 0.8895

2 Likes

Would you mind sharing your kernel?