When creating your databunch, your test set ( which I think it’s better it is called validation set for clarity sake must have same header as your train data). When you split data, you’re telling your learning to train on say 80% of your data and to verify itself on say 20% ( this is what you’re doing when you call add.test()) .
After you train the model, you can now pass new dataset that the model as not seen before similar to what you tried doing … the dataset doesn’t need to have a column for the dependent variable ( what you’re trying to predict )
This way, you’ll get the answer you’re looking for
BOTTOMLINE: the add.test in the initial databunch is an internal validation set.
Ok, now it makes sense, that I need the dependent column but I find the name kind of misleading. I thought the validationset is created from the train-set by using one of the split_... methods and the add_test.... method is for adding real test set with data, the learner has not seen during training.
Ah my bad in that case create a dummy column in your test data called isFraud. If that still isn’t working, make sure that your cat and cont vars don’t have it in there by accident. Worst case I can send you my fastai kernel for this competition on Kaggle
No prob ;=)
I had added this isFraud column but was confused by this post:
because if the learner takes this test-set for valuation while learning, there shouldn’t be a column with just 0-values ?!
Another issue with this competition: Did you write your own "Under-The-ROC-"metric function or do you use the fast.ai standard roc function? Because if I use the fast.ai function I ran into CUDA errors on Colab? Did you experience similar behaviour?
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-76-08fc7ea4f26c> in <module>()
----> 1 learner.fit_one_cycle(3, 1e-2, wd=0.2)
8 frames
/usr/local/lib/python3.6/dist-packages/fastai/metrics.py in roc_curve(input, targ)
292 threshold_idxs = torch.cat((distinct_value_indices, LongTensor([len(targ) - 1]).to(targ.device)))
293 tps = torch.cumsum(targ * 1, dim=-1)[threshold_idxs]
--> 294 fps = (1 + threshold_idxs - tps)
295 if tps[0] != 0 or fps[0] != 0:
296 fps = torch.cat((LongTensor([0]), fps))
RuntimeError: The size of tensor a (9) must match the size of tensor b (2) at non-singleton dimension 1
The doc say: Restricted binary classification tasks.: So I think this results in the tensorsize mismatch error. But this project should be a binary classifier (either isFraud or !isFraud)…
@ulat@muellerzr can you share the auroc test lb score you were able to get with fastai tabular model on this kaggle dataset? I’m using it to practice tabular and wondering how good is good enough - my starter model without any feature engineering scored 0.8789 on public lb.