Tabular Data - Problem with test set

ulat · September 19, 2019, 3:49pm

I am creating a TabularList for a TabularLearner this way:

procs = [FillMissing, Categorify, Normalize]
test = TabularList.from_df(test_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars,)

data = (TabularList.from_df(df=train_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                              #.split_by_rand_pct(valid_pct=0.2)
                              .split_by_idx(list(range(0,10000)))
                              .label_from_df(cols=dep_var, label_cls=None)
                              .add_test(test, label=0)
                              .databunch())

But I get a KeyError. It looks for the dependent column in the test set which isn’t there…

If I remove the .add_test(...) line, everything works fine?! How can I add the test-data?

When I try to predict the y values for the test-set later on, I get a different number of values:

predictions, *_ = learner.get_preds(test)
labels = np.argmax(predictions, 1)
len(test_df_small), len(labels)

(50669, 47244)

I have worked through many threads in the forum but I keep being stuck…

hammao · September 19, 2019, 5:54pm

When creating your databunch, your test set ( which I think it’s better it is called validation set for clarity sake must have same header as your train data). When you split data, you’re telling your learning to train on say 80% of your data and to verify itself on say 20% ( this is what you’re doing when you call add.test()) .

After you train the model, you can now pass new dataset that the model as not seen before similar to what you tried doing … the dataset doesn’t need to have a column for the dependent variable ( what you’re trying to predict )

This way, you’ll get the answer you’re looking for

BOTTOMLINE: the add.test in the initial databunch is an internal validation set.

ulat · September 19, 2019, 6:00pm

Ok, now it makes sense, that I need the dependent column but I find the name kind of misleading. I thought the validationset is created from the train-set by using one of the split_... methods and the add_test.... method is for adding real test set with data, the learner has not seen during training.

hammao · September 19, 2019, 6:04pm

Good luck to you…

muellerzr · September 19, 2019, 6:08pm

Is your test data unlabeled or labeled?

ulat · September 19, 2019, 6:11pm

Seem’s to be a bit tricky.

Now I removed the add_test() method from data:

test = (TabularList.from_df(test_df_small, path=BASE_PATH/'model')) 
data = (TabularList.from_df(df=train_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                              .split_by_rand_pct(valid_pct=0.2)
                              #.split_by_idx(list(range(0,10000)))
                              .label_from_df(cols=dep_var)
                              #.add_test(test)
                              .databunch())

Afterwards create the prediction for the testdata:

predictions = learner.get_preds(test)

But it seems that my learner doesn’t create predictions for each entry in the testset:

len(test_df_small), len(predictions[0])

Output:

(50669, 47244)

Shouldn’t they have the same length?

ulat · September 19, 2019, 6:12pm

The testdata is unlabeled

muellerzr · September 19, 2019, 6:15pm

In that case, what you’ll want to do is something like this:

data = (TabularList.from_df(df=train_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                              .split_by_rand_pct(valid_pct=0.2)
                              .label_from_df(cols=dep_var)
                              .add_test(TabularList.from_df(test_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars))
                              .databunch())

And to grab predictions do

preds = learn.get_preds(DatasetType.Test)

ulat · September 19, 2019, 7:53pm

muellerzr:

data = (TabularList.from_df(df=train_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                              .split_by_rand_pct(valid_pct=0.2)
                              .label_from_df(cols=dep_var)
                              .add_test(TabularList.from_df(test_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars))
                              .databunch())

I am sorry but I had exactly this in my first attempt (except for creating the test TabularList outside of the .add_test function.

If I ran this code snippet I get the key-error (where isFraud is the dependent column that is only present within the training data:

KeyError: 'isFraud'

Full Stack trace:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'isFraud'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
7 frames
<ipython-input-17-fd664806ff07> in <module>()
      2                               .split_by_rand_pct(valid_pct=0.2)
      3                               .label_from_df(cols=dep_var)
----> 4                               .add_test(TabularList.from_df(test_df_small, path=BASE_PATH/'model', cat_names=cat_vars, cont_names=cont_vars))
      5                               .databunch())

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in add_test(self, items, label, tfms, tfm_y)
    558         else: labels = self.valid.y.new([label] * len(items)).process()
    559         if isinstance(items, MixedItemList): items = self.valid.x.new(items.item_lists, inner_df=items.inner_df).process()
--> 560         elif isinstance(items, ItemList): items = self.valid.x.new(items.items, inner_df=items.inner_df).process()
    561         else: items = self.valid.x.new(items).process()
    562         self.test = self.valid.new(items, labels, tfms=tfms, tfm_y=tfm_y)

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, processor)
     81         if processor is not None: self.processor = processor
     82         self.processor = listify(self.processor)
---> 83         for p in self.processor: p.process(self)
     84         return self
     85 

/usr/local/lib/python3.6/dist-packages/fastai/tabular/data.py in process(self, ds)
     62             return
     63         for i,proc in enumerate(self.procs):
---> 64             if isinstance(proc, TabularProc): proc(ds.inner_df, test=True)
     65             else:
     66                 #cat and cont names may have been changed by transform (like Fill_NA)

/usr/local/lib/python3.6/dist-packages/fastai/tabular/transform.py in __call__(self, df, test)
    122         "Apply the correct function to `df` depending on `test`."
    123         func = self.apply_test if test else self.apply_train
--> 124         func(df)
    125 
    126     def apply_train(self, df:DataFrame):

/usr/local/lib/python3.6/dist-packages/fastai/tabular/transform.py in apply_test(self, df)
    175                     if name+'_na' not in self.cat_names: self.cat_names.append(name+'_na')
    176                 df[name] = df[name].fillna(self.na_dict[name])
--> 177             elif pd.isnull(df[name]).sum() != 0:
    178                 raise Exception(f"""There are nan values in field {name} but there were none in the training set. 
    179                 Please fix those manually.""")

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657                 return self._engine.get_loc(key)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'isFraud'

muellerzr · September 19, 2019, 8:00pm

Ah my bad in that case create a dummy column in your test data called isFraud. If that still isn’t working, make sure that your cat and cont vars don’t have it in there by accident. Worst case I can send you my fastai kernel for this competition on Kaggle

ulat · September 20, 2019, 6:28am

No prob ;=)
I had added this isFraud column but was confused by this post:

because if the learner takes this test-set for valuation while learning, there shouldn’t be a column with just 0-values ?!

Another issue with this competition: Did you write your own "Under-The-ROC-"metric function or do you use the fast.ai standard roc function? Because if I use the fast.ai function I ran into CUDA errors on Colab? Did you experience similar behaviour?

 learner = tabular_learner(data, layers=[2000,3000, 1000], ps=[0.001,0.01, 0.01], emb_drop=0.04, metrics=roc_curve, callback_fns=ShowGraph)

–> Training starts but interrupts:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-76-08fc7ea4f26c> in <module>()
----> 1 learner.fit_one_cycle(3, 1e-2, wd=0.2)
8 frames
/usr/local/lib/python3.6/dist-packages/fastai/metrics.py in roc_curve(input, targ)
292     threshold_idxs = torch.cat((distinct_value_indices, LongTensor([len(targ) - 1]).to(targ.device)))
293     tps = torch.cumsum(targ * 1, dim=-1)[threshold_idxs]
--> 294     fps = (1 + threshold_idxs - tps)
295     if tps[0] != 0 or fps[0] != 0:
296         fps = torch.cat((LongTensor([0]), fps))

RuntimeError: The size of tensor a (9) must match the size of tensor b (2) at non-singleton dimension 1

The doc say: Restricted binary classification tasks.: So I think this results in the tensorsize mismatch error. But this project should be a binary classifier (either isFraud or !isFraud)…

ulat · September 20, 2019, 1:38pm

Now I have managed to get rid of the tensor size error by using AUROC() as a metric (instead of auc_roc_score)

learner = tabular_learner(data, layers=[3000,1000, 20], ps=[0.001,0.001, 0.01], y_range=y_range, emb_drop=0.1, metrics=[accuracy, AUROC()], callback_fns=ShowGraph)

But when doing the prediction on the test set I get a new error…
See my notebook on github: https://github.com/we-make-ai/ieee-cis-fraud-detection

@muellerzr By the way, would you mind sharing your notebook?

ulat · September 23, 2019, 3:09pm

Hi! I managed to ran a full training on my local machine. Seems that something wired goes on with colab…

You can check out my kernel on github. Well, no feature engineering just a bit of tweaking layer sizes and embedding drop out.

Do you have any suggestions how I could tune the tabular learner further?

darek.kleczek · September 24, 2019, 9:15am

@ulat @muellerzr can you share the auroc test lb score you were able to get with fastai tabular model on this kaggle dataset? I’m using it to practice tabular and wondering how good is good enough - my starter model without any feature engineering scored 0.8789 on public lb.

ulat · September 24, 2019, 9:24am

Hi!
Here I did some tests on the training data: https://docs.google.com/spreadsheets/d/1_WeQN0zkSMVjlRclZLwwm2zC5usUhriNZGCLysyZ86Q/edit?usp=sharing
The max score I could get on the public lb is: 0.8895

bpisaacoff · September 24, 2019, 8:27pm

Would you mind sharing your kernel?