This is not a fastai specific question, but the community is so helpful here that I thought I would post it anyway.
I am looking for the best cross-validation strategy to test the performance of a churn prediction model (classification).
The model predicts if a client is going to churn in the next 6 months. The dependent variable is simply true/false if the customer churned in the next 6 months. Each month we have usage data for each customer. It is basically a time series dataset per client. The data is highly imbalanced like a lot of churn datasets are where the churn even is pretty rare.
I tried two cross-validation strategy. First a StratifiedGroupKFold cross-validation strategy where each fold doesn’t contain the same clients and each fold contain the same class distribution of churn vs non-churn. This basically ensures that the model is tested on clients it never seen before in the training data, preventing data leakage. But it tests the model on data in the past and the future which is not ideal for time-based data… This cross-validation strategy doesn’t take time into account.
For the second cross-validation strategy I simply split by time and try to predict observations in the future.
The model gets a good AUC score when using StratifiedGroupKFold, but much worse when splitting by time.
Not too sure how to interpret that. The model is not good at extrapolating but it can predict well on clients it never seen before?