P redicting C USTOMER C HURN. Yilun Gu, Anna Klutho, Yinglu Liu, Yuhuai Wang, Hao Yan

Size: px
Start display at page:

Download "P redicting C USTOMER C HURN. Yilun Gu, Anna Klutho, Yinglu Liu, Yuhuai Wang, Hao Yan"

Transcription

1 P redicting C USTOMER C HURN Yilun Gu, Anna Klutho, Yinglu Liu, Yuhuai Wang, Hao Yan

2 EXECUTIVE SUMMARY P ROBLEM How can QWE predict customer churn and increase customer retention in the coming months? I NSIGHTS In our analysis of the QWE customer data for November and December 2011, the three drivers that influence QWE customer churn the most is: Change in Customer Happiness Index (between November and December) Customer Age (expressed in months as a QWE customer) Regency of Logins (expressed through days since last login) R ECOMMENDATIONS Given this information, QWE should implement data-driven strategies to address each of these factors and ensure a reduction in churn in the coming months. Recommended strategies include: Customer satisfaction programs Incentives to increase login recency and frequency ANALYSIS D EFINING CHURN Before investing what drivers influence customer churn, it is important to first investigate the overall churn rate within the QWE customer base. This gives us a basic understanding of the problem QWE is facing with customer retention. Of the 6437 customers in the database, only 323 have left QWE ( churned ) between November and December This gives a starting churn rate of 5.1%. This means that across QWE s customer base, 5.1% have left the company in the last month. I S CHURN CAUSED BY ONE DRIVER ALONE? In this section, we will be discussing ways to evaluate customer churn through the lens of a single customer characteristic -- that is, can churn be explained by one driver alone? CUSTOMER AGE It is natural to assume that the length of a customer relationship ( Customer Age ) would have a large impact on customer retention. But can Customer Age predict churn on its own? The graph left shows the relationship between Customer Age and Customer Churn (where 1 = Customer Churn & 0 = No Customer Churn). In looking at this graph, we see no apparent relationship between Customer Age and Customer Churn. This illustrates that Customer Age doesn t necessarily have a big impact on the probability for a customer to leave QWE or not. Graph 1 Customer Age vs. Customer Churn However, while age is not strongly correlated with churn on its own, as seen later in this paper, age does have an influence in predicting churn when considered with other variables. 1

3 LOOKING AT OTHER DRIVERS CUSTOMER HAPPINESS INDEX (CHI) However, while Customer Age may not be the best to predict churn on its own, this does not eliminate the possibility of other drivers having the ability to singularly explain churn. The question asked here is What driver explains customer churn the best on its own? Using statistical methods like correlation and univariate logistic regression across all 11 provided customer characteristics, our team deeply explored the impact of the current Customer Happiness Index score on its ability to predict customer churn. We chose this variable because it had the strongest association with customer churn (with a correlation value of ) and the highest significance as an individual predictor of churn (p = 2.04e-11). Using this information, our team built a model to predict the probability of customer churn for 3 randomly selected customers (Customers 672, 354, & 5203): Table 1 Probability of Churn for Customers 672, 354, & 5204 Customer Customer CHI Score Probability of Churn* Actually Churn? % No % No % No * P(Churn) = 1 / 1 + e^-[ (CHI) ] This information confirms Wall s theory that happiness would be a major driver of a customer churn. As happiness goes up, the probability of a customer leaving decreases. W HAT VARIABLES CAN BE USED TO PREDICT CHURN? In this section, we will look at other methods that incorporate multiple drivers to predict churn. While the current Customer Happiness Index succeeded in individually predicting customer churn, it logically does not make sense that an outcome be determined by a single variable alone. Therefore, other methods can be used to see what combinations of drivers can best predict churn and which of these variables are most important in this relationship. The following sections provide two possible approaches to answer this question. MULTIPLE LOGISTIC REGRESSION PREDICTING INDIVIDUAL POSSIBILITIES OF CHURN Multiple Logistic Regression (MLR) is a statistical technique that allows us to incorporate multiple customer characteristics to determine the probability of customer churn. The results from this analysis provides for the calculation of churn probabilities for each individual customers, which can then be used to rank customers as the riskiest or most likely to churn. For this approach, our team chose the following customer characteristics to include in the model: Change of Customer Happiness Index Score (between November and December) Customer Age (expressed in months as a QWE customer) Recency of Logins (expressed through days since last login) Current Customer Happiness Index Score Change in Number of Blog Views These variables were selected because they had the highest significance in a model that included all 11 possible customer characteristics (see Exhibit A in the Appendix for more details). This means they had the highest impact on churn within the model. By re-running the model with these 5 characteristics, we can predict the probability of customer churn for the aforementioned randomly selected customers (Customers 672, 354, & 5203): 2

4 Table 2 Probability of Churn for Customers 672, 354, & 5203 (MLR) Customer Probability of Churn* 3.4% 3.3% 5.3% *Please see Exhibit B in the Appendix for MLR Model As mentioned earlier, the advantage of this approach is that we are able to get a list of individual customers and their individual probabilities, allowing QWE management to specifically target the needs of these customers. DECISION TREES - SEGMENTING CUSTOMERS BY CUSTOMER CHURN Graph 2 Decision Tree Output for QWE However, these results would change when calculated with a different method. The Decision Tree method is a predictive model that segments customers based on a set of decision rules. Given the simplistic and graphic nature of this method s output, it is very easy to interpret and guide decisions through the model. In the case of the QWE customers, by entering all 11 variables into the decision tree model, the statistical software package chooses the most important factors that contribute to churn and calculates the probability of churn accordingly. To interpret this graph, we see that four variables have an impact on churn rates: Recency in Login (expressed through days since last login) Frequency of Logins (expressed through number of logins between November and December) Customer Age (expressed as months as QWE customer) Number of Blog Views (between November and December) From these factors, a set of rules are established to frame the likelihood of customer churn. Following the pathways of the tree, if the customer meets that criteria, he/she goes to the left. The final node provides that customer s likelihood of churn. Following this model, we can determine the probability of churn for each of the selected customers (Customers 672, 354, and 5204). A summary of these customers by the four tree variables can be found in the table below. Table 3 Probability of Churn for Customers 672, 354, & 5203 (Decision Tree) Customer Probability of Churn* 3.9% 3.9% 3.9% Given that Days < 17.5 for all three customers, we can follow the decision tree model to conclude that all have a 3.9% chance of customer churn. COMPARING METHODS In comparing the results of the Decision Tree method to that of the Multiple Logistic Regression, there is a difference in the final churn probabilities predicted for each customer (see table below). 3

5 Table 4 Comparing Results for Customers 672, 354, & 5203 Customer Decision Tree Multiple Logistic Regression Customer Actually Churn? % 3.4% No % 3.3% No % 5.3% No This difference occurs for two reasons: Different variables to determine the chance of churn o While the decision tree method uses four variables selected by the computer to determine probability, MLR uses the five variables selected by the team to calculate the chance of churn. Difference in how they calculate probability o The decision tree model predicts according to rules it s established, starting with the single factor at the top of the tree, and provides a single probability for each group of customers. o MLR calculates probability individually by the characteristics of the customers themselves; since the customers vary across the selected variables, it makes sense that their probabilities be different as well. W HICH METHOD TO USE? While both methods have their advantages and disadvantages, our team recommends using the Multiple Logistic Regression method to determine which customers are most likely going to churn. Why? Decision Tree is less precise in predicting individual probabilities according to each customer. With the tree model, we could get the customer segment most likely to leave, but cannot narrow the range any further. With MLR, we are able to get a list of individual customers and their individual probabilities, allowing QWE management to specifically target the needs of these customers. ACCURACY It s important to evaluate the accuracy of our recommended method as well. Given that accuracy reflects the percentage of what we predict will happen versus what actually happened, it is important to maximize accuracy in order to correctly capture the current situation. In addition, accuracy is an important measure, as it is easily understood and communicated across a business. Our team chose a threshold of 12% (i.e. we predict a customer will churn if P(C) 12%), as it provides the highest accuracy across this model overall. At this level, the MLR model has a 93.4% accuracy rate. W HO IS MOST LIKELY TO CHURN? Given the results from our analysis, the following customers have the highest probability of churn: Table 5 Top 10 Customers Most Likely to Churn in the Coming Months Customer Number Probability of Churn Actually Churn? No Yes No No No No No No No Yes In looking at this table, we see that while these 10 customers have the highest probability of churn within the customer data set, only two of these customers have actually churned in the last month (November to December). While this implies a lack of accuracy for the model, this weakness is offset by the benefit that this model provides: the ability to rank each customer individually by their probability of churn. Our team argues that the model instead captures the potential of churn in the coming months, and that the remaining 8 individuals should be watched and managed carefully in the next month to ensure they do not leave QWE. 4

6 RECOMMENDATION In the end, we see that the following three drivers* have the highest impact on predicting customer churn: Change in Customer Happiness Index (between November and December) Customer Age (expressed in months as a customer) Regency of Logins (expressed through days since last login) *While the MLR model included 5 drivers in is calculations, these three characteristics had the most significant coefficients in the model Intuitively, this relationship makes sense; a change in happiness level, the length of a customer relationship, and the activeness of the customer (as expressed through a recency in logins) logically could have a significant impact on customer churn. This fact is confirmed by our model. Therefore, it is recommended that QWE management take action to build strategies to address these three drivers in their operations. Such strategies include: Customer Satisfaction Programs When the Customer Happiness Index score drops dramatically, personalized outreach to these individuals with problem-solving solutions would be beneficial. Incentives to Increase Login Recency and Frequency One possible incentive is the reduction of QWE subscription price based on the number and frequency of logins in a month. T herefore, by implementing these strategies, QWE may be able to reduce churn for their company in the future. APPENDIX EXHIBIT A RESULTS FROM MULTIPLE LOGISTIC REGRESSION WITH ALL 11 VARIABLES Estimate Std. Error p-value Significance Level (Intercept) -2.76E e < *** CHI e e *** Age 1.271e e * Change in CHI e e *** Cases e e Change in Cases 1.703e e SP 1.593e e Change in SP e e Logins 2.893e e Blogs 2.905e e Views e e ** Days since Last Login 1.724e e *** EXHIBIT B MLR PREDICTION MODEL 5