Data leakage in Fastai Tabular?

DanyWin · January 12, 2020, 11:43am

Hello everyone !

I am working on tabular data, to predict whether not a particular person has an account of a specific type.

I have run a Random Forest which worked fine with 0.85 accuracy, and I wanted to try Fastai Tabular afterwards.

Nonetheless, I ran in some problem along the way, I had a perfect accuracy of 1.

I later ran some tests, and I found an interesting behavior :
When I fetch an observation, and put it in the learn.predict, it needs to have the dependent variable in it, otherwise you get key error.

Fine, but I want to use that to predict it when I don’t have it, so I filled it with np.nan, thinking that in any case it should not be used during prediction, and the data bunch will just toss it away for prediction, but it actually changed the predictions !

Even more interesting, when I changed this field to 1, the prediction becomes 1 with almost 0.99 probability, and when I change it to 0, it becomes 0 with the same 0.99 probability.

I don’t know if the learn.predict is supposed to be used only for validating observations where we know the label, but I find it troubling, because this seems to indicate that the label of interest is used for prediction !

I haven’t checked under the hood, but I was pretty sure to have mentionned the correct dep_var when initializing my Tabular Learner, so I think there is some data leakage in some part.

Can someone help me on this issue ?

sgugger · January 12, 2020, 1:17pm

No one can help without seeing the actual code. It’s very likely there is something in there that introduces the data leak.

DanyWin · January 12, 2020, 1:53pm

Found the answer, thanks !

Given that it was classification, I put target label among the cont_var, therefore it was used ^^