How to pick the validation set?

Hi Guys,

I’m working on a data set that utilizes log-in and transaction information from a eCommerce website to predict log-in risk (classification, 0/1).

The training data are transactions from 2015-01-01 to 2015-06-30, test data are from 2015-07-01 to 2015-07-31.
(competition link if you are interested:

This is a Kaggle-like competition, so I can see my score (fbeta with beta=0.1) on public leaderboard.

To estimate my model performance, I used two approaches, which gives very different results:

  1. 5-fold cross validation on entire training set
  2. Out-of-time validation using data form 2015-06-01 to 2015-06-30

For both approaches, I trained my submission model on full training set.

I trained a Random Forest with 500 trees, here the scores I got:

  1. avg. score of 5-fold cross-validation is 0.87
  2. score of OOT (June data) validation is 0.39
  3. score on test set (Public Leaderboard) is 0.66

After I saw the huge difference between 1 and 2, my first guess is data leakage happened somewhere. While I cannot rule out the possibility of data leakage, I checked the distribution of risk transactions over the time:

I saw the rate of risk transactions is much lower in June than the months before. I believe this explained why score on OOT set is so different from cross validation and test set.

My question is:

How can I know whether the difference between scores of OOT, CV versus test is due to the shifted distribution of target variable or something else?

How can I choose a validation set that I can rely on in this case?



Good analysis! BTW I’m not sure you’re using the term ‘data leakage’ quite correctly - see here for more info . Is that what you’re referring to, or something else?

You need to figure out how the test set is structured. Try making ‘is_test’ a dependent variable, like we did in the ‘Extrapolation’ section in class. Hopefully you can use that info to figure out how it is different.


Thank you for your suggestion.

For your 1st point, the content of the link you provided is what exactly I meant by "data leakage’. It is because I see score of cross validation is much higher than score of OOT validation. I suspect if there is any time related features caused the leakage, since during cross validation, it might be possible, say, that using transaction from June as train and use transaction from May as validation.

I do have some time related features, I calculated:

  1. Count of logins/transactions from the same user prior to the current transaction
  2. Length of time from the first login/transaction to current transaction

However, I cannot reason how data leakage happens with these features. Do you think data leakage is possible here?

For your second point, we already know train/test is split on time basis (train: January to June, test: July). Do I still need to do ‘is_test’ prediction?

Thanks again for your reply!