I am working on a Telecom project to classify a customer as churner or staying customer. the scoring of the model is intended to be real-time or at least on hourly basis.
As you know, the churn percentage usually is very small between 1%-5% so the data is unbalanced.
While preparing the data for training I considered the following:
- Let’s say we are training the model based on the customer base snapshot of July 1st
- We monitor the churners in the period between July 1st and Sep 1st (2 months churners)
- We gather all the activities that the customer did between 1 June 1st and Sep 1st plus other profile features (Tenure, Segment, ARPU, etc…)
- let’s assume that the customer base is 1 million records, and the churners count in the 2 months period is only 30,000
- We randomly split the data set into Training and Validation ( 70% and 30%)
- When we train a RandomForest algorithm on the training set ~ 700 k customers (without SMOTE, Down sampling techniques) we get very good Precision and recall on the validation test.
- Usually in Telecom we use the model performance measure “Lift” to measure how many churners we were able to identify in the first decile if we were to split the training dataset into 10 splits (decile) .
- The lift is calculated based in the ordered probability of the churn prediction from the algorithm (descending order) and split the whole dataset into 10 splits
- In this scenario we get a very high lift around 6-8x lift and we were able to correctly identify 70% of the churners in the confusion matrix
- When we applied the SMOTE we got a very low confusion matrix measures (Recall, Precision)
Please check this link for more info about the lift
Now my question is:
- When we try to use a test data set for customers snapshot on Sep 1st and extracted the activities that customer did between Sep 1st and Oct 1st, we got a very low Lift ~ 2.5-3x . and very low Recall and Precision.
I know this is a sign of over-fitting. but I suspect that we did something wrong in our data preparation process in either training or test data set. or may be the performance matrix that we are using is not the right one to use.
I am trying to use some of the techniques discussed in lesson 2 to avoid the over-fitting, I will update the post if I got different results.
Please let me know your thoughts.