As companies work to become more "data-driven", there are clear trade-offs when it comes to applying machine-learning models to solve business challenges. One such tradeoff is between the accuracy of a model, and whether insights from that model can be interpreted by the relevant business unit and acted upon.
In this project, I discover that Data Scientists cannot just focus on creating the most accurate models: we need to balance accuracy with interpretability and action so that our insights are useful. Data Scientists can provide more value to their teams and companies when they understand the business's strategy.
- Exploratory Data Analysis
- Machine Learning Models
- Evaluating the Algorithms
- Model Selection and Recommendations
Telco is a landline telephone subscription company which has been losing customers; senior management at Telco is worried about their customer churn rate, or the rate at which customers stop paying for their service. Customer churn rate is a key performance indicator (KPI) that every subscription-based company needs to minimize: a low churn rate helps a subscription-based company maintain their revenue flow and avoid the costly process of acquiring new customers.
For this project, I will be using the Telco Dataset to address the problem of churn rate. Acting as a Data and Strategy Analyst at Telco, I create machine-learning algorithms using Logistic Regression, Random Forest and Decision Tree methods to understand why customers churned (Churn = Yes) and predict which customers are most likely to churn next. To calculate and improve my predictions, I also use Confusion Matrixes, Error Rate plots, Feature Importance plots, and Boosting.
Based on my findings, I write recommendations to Telco's senior management team as to how they can retain the customers who are most likely to leave. Given that few members of my audience understand machine-learning models, I evaluate the benefits and draw-backs of each model, and pick those which maximize the usefulness of insights gained, to ensure that Telco's strategy team can implement my findings.
- Compare machine-learning methods to predict why customers churned
- Decide which model is the best for communicating to senior management
- Use this model to identify which customers are most likely to churn next
- Provide implementable recommendations for Telco's senior management to decrease churn rate
Exploratory Data Analysis
First, I read in the data and load R packages
There are 7043 customers represented in this dataset, and this dataset has 21 variables. Each customer is assigned a unique customerID. Tenure indicates how many months the customer has stayed with the company. MultipleLines means that the customer has multiple telephone lines connected to their account.
There are 11 NAs in the TotalCharges variable. I choose to remove the rows that contain NAs. I also clean the MultipleLines column and SeniorCitizen column. I remove the CustomerID column because it is a factor with over 7000 levels, and will corrupt the models I build.
WHICH CUSTOMERS CHURNED?
Methods: DescTools function and GGPlot2 package
CHURNED CUSTOMERS: DEMOGRAPHICS
CHURNED CUSTOMERS: TELCO USAGE
CHURNED CUSTOMERS: PAYMENTS
Machine Learning Models
HOW ACCURATELY CAN WE PREDICT CHURN?
Methods: Logistic Regression, Decision Tree, Random Forest
To provide recommendations for Telco's senior management, I conduct, improve and compare machine-learning models to identify how they can decrease churn. First, I create my training and testing datasets.
LOGISTIC REGRESSION MODEL 1
I included all the variables in my logistic regression.
From this summary, we can see that many variables are not important for predicting whether or not a customer will churn, based on the lack of asterisks and a high p value. Contract type, Total charges and tenure are the most statistically significant variables.
HOW ACCURATE IS LOGISTIC REGRESSION MODEL 1?
Using an Analysis of Variance (ANOVA) helps me identify which features are the most important, so that I can simplify my logistic regression.
As we add each variable, we can see the drop in deviance of the residuals. Adding InternetService, tenure and MultipleLines significantly reduces the residual deviance. Even though StreamingTV and StreamingMovies have low p values, they also only provide a small reduction in residual deviance.
To simplify the model to be interpreted, I limit the model to the most relevant variables I identified in the first summary, or those with at least a 0.001 level of significance (***).
LOGISTIC REGRESSION MODEL 2
Improvement Method: 2nd iteration
HOW ACCURATE IS LOGISTIC REGRESSION MODEL 2?
Surprisingly, Logistic Regression Model 2 better predicts churn on my testing dataset, with 81.2% accuracy. This suggests that my first logistic regression overfit my training data.
Improvement Method: Boosting
We can use a tree model to divide Telco customers based on factors that increase their likelihood to churn. Each branch indicates a decision boundary that divides up the customers.
The Boosting method ranks each variable based on its importance for predicting customer churn. Using boosting for the tree model allows me to rank the variables which should be the highest priority to Telco's senior management, so that Telco could be very strategic in how it allocates time and resources for retaining customers.
This summary shows that Contract type has by far the greatest relative influence on the customer's decision to churn, according to the training data. This is followed by tenure and Online Security, which does not appear in the Decision Tree.
Improvement method: Error Plot + Tuning
Now I use the caret package to build my random forest model.
HOW ACCURATE IS THE RANDOM FOREST MODEL?
I used this error plot to determine whether an increase in the number of trees in the random forest model will lead to a significant decrease in error rate. We see that above about 200 trees, there is no significant difference in any of the error rates, so I can limit my model to 200 trees.
FIT THE NEW TREES AFTER TUNING
WHICH FEATURES ARE THE MOST IMPORTANT ACCORDING TO THE RANDOM FOREST MODEL?
Evaluating the Algorithms: which model is best?
For the purposes of presenting my findings to Telco's senior management, I am judging each model on not only its accuracy but also its interpretability and the ease with which its findings can be implemented.
- Accuracy for Test set prediction: 81.2%
- Pros: The logistic regression was my most accurate model. It also tells the management team whether they should try to increase or decrease each variable affecting churn rate. For example, ContractOne (one year contract) and ContractTwo (two year contract) outcomes had negative estimates in the model. This means that increasing the number of customers with these types of contracts will decrease churn.
- Cons: There are too many significant variables (with *** asterisks and low p-values) in this model, so it does not tell senior management which variable should be prioritized first.
DECISION TREE MODEL
- Accuracy for Test set prediction: 79.3%
- Pros: The single decision tree was the simplest and most easily interpretable model. It creates a visual hierarchy for each of the variables affecting churn, and prioritizes Contract type first, followed by InternetService, tenure and MonthlyCharges. It could provide an effective set of strategies for senior management.
- Cons: This is the least accurate model, and may be overly simplistic.
RANDOM FOREST MODEL
- Accuracy for Test set prediction: 80.82%
- Pros: This model was very accurate, and tuning improved the error rate and sensitivity of the model. The VarImportance function ranked the variables in order of importance to senior management.
- Cons: This model is a "black box", so is difficult to interpret. Using the VarImportance plot does not give insights into how these variables impact churn.
Model Selection and Recommendations
Based on the pros and cons of each method, I have decided to proceed with the Decision Tree as my model to help senior management prioritize which variable to address to improve customer retention. I will then use Logistic Regression Model 2's findings to indicate how senior management should address each variable in order to reduce churn.
The Decision Tree is the simplest model to explain because it shows a clear visual hierarchy as to what Telco should focus on first: change the customer Contract type to One or Two years, instead of Month-to-Month where possible.
Combining the Decision Tree with the positive and negative estimates of Logistic Regression Model 2, Telco can see each variables as a lever that will increase or decrease churn. For examples, the Payment Methods which reduce churn are Credit Cards and Mailed Checks, whereas customers paying through Electronic Check are significantly more likely to churn.
INSIGHTS FOR SENIOR MANAGEMENT: WHAT MAKES CUSTOMERS MORE LIKELY TO CHURN?
These findings are based on the Logistic Regressions:
- Contract: Customers using a Month-to-month contract are significantly more likely to churn than customers using a One year or Two year contract.
- Tenure: Customers with shorter Tenure were more likely to churn.
- Payment Method: Customers using an Electronic Check were more likely to churn than those who paid automatically using a Credit Card or by a Mailed Check.
- Total Charges: Customers who had higher Total Charges were more likely to churn.
- Paperless Billing: Customers who used Paperless Billing were more likely to churn.
- Multiple Lines: Customers who had Multiple Lines were more likely to churn.
- Tech Support: Customers with no Tech Support were more likely to churn.
- Phone Service: Customers with no Phone Service were more likely to churn.
CHURN CUSTOMER PROFILE: CUSTOMERS MOST LIKELY TO CHURN NEXT
Based on the findings in the Decision Tree, we can identify that these customers are most likely to churn.
- Contract: Month-to-Month
- Internet Service: Yes
- Tenure: < 15.5 months
IDENTIFY WHICH CUSTOMERS WILL CHURN NEXT
Method: Subset the dataset based on the Churn Profile and Boosting results
Using the customer profile, I create a subset of customers who are vulnerable to churn, called, "vulnerable customers".
There are 333 vulnerable customers. Telco can keep track of these customerIDs to measure which of these customers churn once their solution is implemented, and thus gauge how effective that solutions would be. To enable them to do this, I list the first 20 customerIDs of these customers.
HOW COULD TELCO PROACTIVELY PREVENT THESE CUSTOMERS FROM LEAVING? WHAT MIGHT A CUSTOMER RETENTION PROGRAM LOOK LIKE?
In order of priority for retaining customers, Telco could implement the following solutions:
- Contract: Contract type is by far the most important variable for customer retention, and should be Telco's priority. Telco could advertise the ease of One Year or Two Year contracts for Month-to-Month customers to encourage them to switch, and could cease to offer the Month-to-Month contracts to new customers. They could also provide a small cash-back incentive for customers who switch.
- Tenure: This is highly dependent on the Contract type of each customer.
- Internet Service: Telco could create a bundled deal for customers who want Fiber Optic internet service as well as phone service.
- Payment Method: Telco could advertise the ease of switching to automatic Credit Card payments for customers who pay by Electronic Check. Telco could cease to offer Electronic Checks as an option for payment to new customers who register.
- Total Charges: As customers are financially sensitive to the total amount they are charged, Telco can sweeten the deal by offering discounts on other variables, such as the Payment Method or Contract type.
- Multiple Lines: Telco could create a cheaper package for customers who want multiple telephone lines in their home.
From this challenge, I learnt that it is important to include Data Scientists in the strategy and business model planning of a company, if their work is to be valuable for solving business problems.
Exploring a variety of machine learning models showed me how the simplicity of a single Decision Tree might be much more useful to a senior management team who wants to understand churn rate, rather than a much more complicated "black box" model like Random Forest. For this case, Telco is probably okay to sacrifice a couple percentage points in accuracy for predicting outcomes in the test data. But the question of which model to choose is ultimately a judgment call for the Data Scientist. If I was building a classification model to be applied in the health sphere, for example, 0.5% accuracy might make a big difference to the outcome, and I might have picked a "black box" model.
In the Telco case, the real challenge was to generate a model whose findings were insightful into the way that customers behaved, and from that create tangible, resource-efficient steps the company can take to reduce churn.
I experimented with some other models we learnt in class to confirm my findings that the Logistic Regression was appropriate.
The Logistic Regression (gbm) also performs marginally better than the other models such as K Nearest Neighbor (knn) and about the same as Linear Discriminant Analysis (lda).