Best cross-validation strategy for churn prediction

etremblay · May 21, 2021, 1:09pm

This is not a fastai specific question, but the community is so helpful here that I thought I would post it anyway.

I am looking for the best cross-validation strategy to test the performance of a churn prediction model (classification).

The model predicts if a client is going to churn in the next 6 months. The dependent variable is simply true/false if the customer churned in the next 6 months. Each month we have usage data for each customer. It is basically a time series dataset per client. The data is highly imbalanced like a lot of churn datasets are where the churn even is pretty rare.

I tried two cross-validation strategy. First a StratifiedGroupKFold cross-validation strategy where each fold doesn’t contain the same clients and each fold contain the same class distribution of churn vs non-churn. This basically ensures that the model is tested on clients it never seen before in the training data, preventing data leakage. But it tests the model on data in the past and the future which is not ideal for time-based data… This cross-validation strategy doesn’t take time into account.

For the second cross-validation strategy I simply split by time and try to predict observations in the future.

The model gets a good AUC score when using StratifiedGroupKFold, but much worse when splitting by time.

Not too sure how to interpret that. The model is not good at extrapolating but it can predict well on clients it never seen before?

Thanks!

maxwelll · May 22, 2021, 6:56am

For time series what you want to know is if your model is good or not to predict future. Not doing prediction in the past (that’s what happened when you used the stratified method). Of course your model is better at predicting the past!

If I was you, I’d split my dataset in 2 chronically (80/20), then split the train and val in k using the stratification method.

Hope this will help!

etremblay · May 23, 2021, 11:04pm

Hey thanks for your reply! My main challenge is that I don’t have that much data. But splitting it by time is what makes the most sense indeed.

Thanks,