Mining Churning Behaviors and Developing Retention Strategies Based on A Partial Least Square (PLS) Model

Size: px

Start display at page:

Download "Mining Churning Behaviors and Developing Retention Strategies Based on A Partial Least Square (PLS) Model"

Annis Woods
5 years ago
Views:

Accepted Manuscript Mining Churning Behaviors and Developing Retention Strategies Based on A Partial Least Square (PLS) Model Hyeseon Lee, Yeonhee Lee, Hyunbo Cho, Kwanyoung Im, Yong Seog Kim PII:

1 Accepted Manuscript Mining Churning Behaviors and Developing Retention Strategies Based on A Partial Least Square (PLS) Model Hyeseon Lee, Yeonhee Lee, Hyunbo Cho, Kwanyoung Im, Yong Seog Kim PII: S (11) DOI: doi: /j.dss Reference: DECSUP To appear in: Decision Support Systems Received date: 27 July 2010 Revised date: 5 June 2011 Accepted date: 24 July 2011 Please cite this article as: Hyeseon Lee, Yeonhee Lee, Hyunbo Cho, Kwanyoung Im, Yong Seog Kim, Mining Churning Behaviors and Developing Retention Strategies BasedonAPartialLeastSquare(PLS)Model,Decision Support Systems (2011), doi: /j.dss This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

2 Mining Churning Behaviors and Developing Retention Strategies Based on A Partial Least Square (PLS) Model ABSTRACT Hyeseon Lee, Yeonhee Lee, Hyunbo Cho, Kwanyoung Im Division of Industrial and Management Engineering Pohang University of Science & Technology San 31 Hyoja, Pohang , Republic of Korea Yong Seog Kim* Management Information Systems Department Jon M. Huntsman School of Business 3515 Old Main Hill Utah State University Logan, UT , USA In a very competitive mobile telecommunication business environment, marketing managers need a business intelligence model that allows them to maintain an optimal (at least a near optimal) level of churners very effectively and efficiently while minimizing the costs throughout their marketing programs. As a first step toward optimal churn management program for marketing managers, this paper focuses on building an accurate and concise predictive model for the purpose of churn prediction utilizing a Partial Least Square (PLS)-based methodology on highly correlated data sets among variables. A preliminary experiment demonstrates that the presented model provides more accurate performance than traditional prediction models and identifies key variables to better understand churning behaviors. Further, a set of simple churn marketing programs--device management, overage management, and complaint management strategies is presented and discussed. KEYWORDS Partial Least Square; Customer Relationship Management; Business Intelligence; Churn Management *) Corresponding author: Tel: Fax: yong.kim@usu.edu *) Preliminary version of this paper was published and presented in AMCIS

3 Mining Churning Behaviors and Developing Retention Strategies Based on A Partial Least Square (PLS) Model ABSTRACT In a very competitive mobile telecommunication business environment, marketing managers need a business intelligence model that allows them to maintain an optimal (at least a near optimal) level of churners very effectively and efficiently while minimizing the costs throughout their marketing programs. As a first step toward optimal churn management program for marketing managers, this paper focuses on building an accurate and concise predictive model for the purpose of churn prediction utilizing a Partial Least Square (PLS)-based methodology on highly correlated data sets among variables. A preliminary experiment demonstrates that the presented model provides more accurate performance than traditional prediction models and identifies key variables to better understand churning behaviors. Further, a set of simple churn marketing programs--device management, overage management, and complaint management strategies is presented and discussed. KEYWORDS Partial Least Square; Customer Relationship Management; Business Intelligence; Churn Management 1. INTRODUCTION For many firms in a very competitive market and business environment, customer churn management becomes one of the most critical success factors mainly due to higher acquisition costs for new customers. In particular, with exceptionally high annual churn rates (20 40%), the firms in the mobile telecommunications industry try to develop predictive models that accurately identify which customers are most likely to terminate the current relationship. Therefore, the optimal selection of customer targets (e.g., most probable churners) and marketing campaign size where the profit of marketing campaign is maximized has been considered a critical success factor in the customer relationship management (CRM) program. However, according to a recent study [36] based on a Dutch survey involving 228 database marketing companies, many marketing managers still select their targets either intuitively or based on the long-standing methods such as cross tabulation or RFM (recency, frequency, monetary), a segmentation based database marketing method [32]. For example, in RFM analysis, marketing managers first divide customers into a total 125 groups (5 groups for each of three 1

4 past purchasing pattern dimension) in such a way that customers within the same group are similar in terms of past purchasing patterns such as how recently they have made a purchase (recency), how often they have made purchases (frequency), and how much they have spent (monetary). Then, marketers narrow down their focus on customer groups who have recently made a purchase, have made multiple purchases, and have spent more money in their customer lifetime. This RFM analysis and its variants have been successfully used for more than 30 years because they are simple and easy to apply. However, they are very limited in the sense that they cannot be applied to new customers in target markets because no RFM information is available for new customers. In addition, they do not provide any insights about new customers and markets, and hence cannot be used to expand customer bases because marketers shift their focus on only customers with high scores of RFM variables, which, in return, will boost the values of RFM scores for targeted customers while lowering down those of customers with low RFM scores. Finally, RFM analysis may not be applicable to certain products. For example, customers who recently purchased an expensive car are not likely to buy another car very soon. In contrast, customers who purchased a car several years ago are more likely to purchase another car. Fortunately, the rapid development of new data mining and business intelligence models over two decades in computer science, cognitive science, and information system area makes it possible for marketing managers to disseminate micromarketing messages targeted toward a specific group of households that are most likely to open to the customized incentives. When a churn management program is operated successfully, marketing managers can further identify most likely churners in advance, and then develop and present the right retention offers customized across different customer segments. Both marketing [4][29] and data mining researchers [1][6][20][35] have presented various database marketing approaches for successful CRM programs. Several studies presented models to target households by using the knowledge extracted from the customer s purchase history [11][32]. In other studies [25][26] the profitability condition of a campaign was explicitly formulated as a function of the model performance along with campaign cost and revenue factors such as mailing costs and marginal revenue per identified positive record. In another study [16], a business intelligence model was presented to identify as many customers as possible who will respond to a specific solicitation campaign letter by studying the effects of variable selection and class distribution on the performance both of a primitive classifier system and of a relatively more sophisticated classifier system. Other studies showed the effects of customer satisfaction and switching barrier on customer loyalty in Korea[15] and the relationship among customer retention, loyalty, and satisfaction in German [9], respectively. 2

5 A few models narrow their interests such as selecting prospects in the automotive industry [10] and identifying most likely insurance buyers [17]. In particular, several studies aim to provide accurate models for churn prediction in the telecommunication industry [3][7][37] based on various statistical and data mining models such as neural network models, support vector machines, or clustering algorithms along with customers demographics, billing, and call detail records. A recent discussion about advantages and disadvantages of various models for churn management can be found in [14]. However, most of these models have not focused on identifying critical success factors but on building an accurate churn prediction model only, which may only serve as the first step of developing churn marketing programs. In this paper, we present not only a prediction model to accurately predict customers churning behavior but also a simple but implementable churn marketing program. Note that for successful initiation and management of churn marketing programs, both the predictive accuracy and the comprehensibility of a model become important. In particular, a rule-based system that consists of too many if then statements cannot provide the key drivers of consumer behaviors and hence greatly reduces managers trust in the system itself. To address both important issues simultaneously, we propose a Partial Least Square (PLS)-based methodology that allows a marketing manager to maintain an optimal (at least a near optimal) level of churners effectively and efficiently through her marketing programs. The detailed objectives are: (i) to build an accurate and concise predictive model for churner prediction based on PLS-based methodologies from the vast amount of data sets of highly correlated variables; (ii) to conceptually design an effective churn control model after understanding key drivers of consumer behaviors; and (iii) to develop a segmentation based marketing campaign strategy, to validate its effectiveness for a chosen campaign size, and to determine an optimal marketing campaign size that maximizes the campaign profit. In our approach, PLS is employed as the prediction modeling method because it places minimal demands on measurement scales, sample size, and residual distributions, and it is capable of handling a large number of highly correlated variables, measurement errors, and missing data [21][24][38]. Further, PLS models naturally can be used for dimension reduction through variable selection mechanism based on the variable important in projection (VIP) scores. Therefore, it is possible that PLS models can be used not only for constructing highly accurate model but also for enhancing the comprehensibility of models by choosing a subset of the original predictive variables. By doing so, marketing managers can save a great amount of efforts and costs in identifying key determinants of churn behaviors of 3

6 customers. However, eliminating many input variables may have different effects on the predictive accuracy of models depending on their representational powers and structural complexities. Therefore, this study aims to analyze the relationship between variable selection and the performance of several PLS models. The remainder of this paper is organized as follows. Section 2 provides a review of PLS methodology. In Section 3, the research framework based on PLS models is introduced, and evaluation metrics to compare various predictive models on a data set collected by a mobile phone service provider are explained. In the following Section 4, experimental results of linear and nonlinear PLS models compared to Logit and random models will be presented. Section 5 illustrates a scenario of the optimal churn marketing campaigns based on customer segmentations. Finally, Section 6 provides the conclusion of the paper and suggests several direction of further research. 2. LITERATURE REVIEW Various statistical models such as logistic and ordinary regression models, canonical correlation, or structural equation modeling (SEM) have been used to investigate relationships between independent (=predictors) and dependent (=responses) variables. Note that while ordinary regression method models a continuous dependent variable as a linear combination of continuous predictor variables, logistic regression method (or a logit model) models the log odds of a binary dependent variable as a linear combination of a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these the dependent variables. In particular, binomial logistic regression has been frequently used to estimate the odds of a certain event occurring based on maximum likelihood estimation (MLE) technique. In addition, logistic regression does not require normally distributed variables and does not assume homoscedasticity, and has been widely adopted by marketing researchers to study brand choice behaviors of customers [12][13]. While logistic regression model has been successful in several marketing research areas, a new multivariate projection approach is studied for predicting churning behaviors in this study. The PLS method is one of such methods, and it can be also used to reduce the original large-scale data to lower dimensional data to deal with highly (both linearly and nonlinearly) correlated data between independent variables and dependent variables [8][19][22]. Therefore, one of main objectives in PLS analysis is to find a few PLS factors that explain most of the variation in both independent and dependent variables. The PLS factors that explain most of the variation in responses using observed information of the 4

7 predictors can consists of good predictive models for new responses. Fundamentally, the PLS method can be used as an alternative to well known models such as logistic or ordinary linear regression. One of the most popular application areas of the PLS method is in process optimization. In such an application, a PLS model is applied to extract latent variables as a linear or non-linear combination of process variables and to determine the optimal settings in the process variables from the latent variable spaces [5]. In another recent study [38], PLS method is compared with a well known data mining algorithm, random forest, to identify the key variables that cause much variation in bulk vaccine yield so that vaccine researchers can optimize new control processes to reduce the variation. In a study [18], PLS is utilized in building a predictive system with some success to assess failure probability in small to medium-sized Finnish firms using financial and non-financial variables and reorganization plan information. Note that the PLS method can model both multiple responses and multiple predictors variables, even when multicollinearity among predictors are suspected. Further, the PLS method is known to be robust when there are many observations with missing values in the data. However, it is difficult for the researcher to interpret loadings of the independent latent variables from the PLS method, and hence it is favored as a predictive technique and not as an interpretive technique. The PLS method can be implemented either as a regression model to predict response variables or as a path model to understand the structural relationship among records. In this study, the PLS method is used as a regression model to predict churners based on demographic, psychographic, and historical service usage information. We also note that in prediction tasks, the number of latent variables is an important factor that affects the predictive accuracy. The number of latent variables is usually chosen by a cross validation considering the proportion of variations explained by each latent variable. The variable importance in projection (VIP) has been proposed and used as a measure of importance of each variable contributing the response of interest. After PLS model is performed, the VIP of j-th variable is calculated as the sum of weights on the latent variable from j-th variable divided by the total weights. Since the average of squared VIP over all variables is 1, greater than one rule is generally used as a criterion for selecting significant variables. In the nonlinear PLS models, the nonlinear functional relationship can be constructed by algorithms using neural networks or Gaussian kernel. Qin and McAvoy [27] propose neural net PLS to approximate nonlinear mapping between the input and output score variables, and show better prediction than PLS and direct neural network. Malthouse et al. [22] implement nonlinear PLS with feed forward 5

8 neural network and apply to the model with multiple response variables. The nonlinear PLS based on neural network mapping requires quite computational time and it has limitation applying to large size data. Rosipal and Trejo [30] proposed a kernel based partial least squares regression method. Kernel methods are learning methods to find nonlinear patterns using linear models. By moving input data to higher dimensional space using mapping functions, linear patterns in the expanded space which represents nonlinear patterns in the original feature space can be found [33]. It is also possible to detect nonlinear patterns in the original input space without explicit definition of mapping functions using a kernel function which represents an inner product of two vectors in the expanded space. For this study, we used Gaussian kernel function for the nonlinear PLS which especially works well in cases that an original input space dimension is so low that expansion of input space into a certain higher dimensional abstract space provides much more information than before. 3. RESEARCH MODEL AND DATA SET 3.1 Research model Our research framework consists of four sequential steps based on typical CRM processes, as illustrated in Figure 1. As the first step, it is necessary to preprocess raw data into a readily available format for further analysis. In this study, two different techniques eliminating records with missing values and variable selection are used separately or together for preprocessing raw data. Once preprocessed data sets from raw data are obtained, three different types of classifiers (Logit regression, linear and nonlinear PLS models) are calibrated and evaluated. In this process, the performance of a random model will be also evaluated to highlight additional gains of predictive power by using any one of three intelligent models. As one of intelligent model types, several linear PLS models will be calibrated and evaluated in terms of comprehensibility, computational complexity, and predictive power related metrics such as hit ratio. Then nonlinear PLS model and Logit regression models are constructed and compared with the most accurate linear PLS model. Multiple Logit regression models with different pre-defined significance values will be considered to estimate the effectiveness of dimension reduction via variable selection through stepwise variable selection and the PLS method. We also intend to compare linear PLS model to nonlinear model to see if a nonlinear relationship mapping between predictors and response variables can boost predictive power of the linear PLS models. Finally, managerial insights extracted from experimental results are discussed to help data analysts and marketing managers develop a successful churn management program by improving service quality and 6

developing management strategies through hardware replacement, complaint management, and service quality improvement. Figure 1 Research framework 3.

9 developing management strategies through hardware replacement, complaint management, and service quality improvement. Figure 1 Research framework 3.2 Data set The data sets used in this study are provided by the Teradata Center for CRM at Duke University, and the original data set for calibration has 171 predictor variables of 100,000 observations. The complete set of variables includes three types of variables: behavioral information such as minutes of use, revenue, handset equipment; company interaction information such as customer calls into the customer service center; and customer household demographics. For each customer, churn was calculated based on whether the customer left the company during the 31- to 60-day period after the customer was originally sampled. Although the actual percentage of customers who left the company in a given month is approximately 1.8%, churners in the original data set were oversampled to create roughly a split between churners and nonchurners. However, the test data set with 51,306 observations are expected to represent a realistic churning rate, 1.8%. As a preprocessing step for further analysis, the original data sets are preprocessed as follows. First, most categorical variables are excluded because of high missing rate or being encoded into multiple binary variables which makes low predictive power. We include only 11 categorical variables which are either indicator variables or countable variables such as number of handsets and number of subscribers. This is because each categorical variable has very little predictive power in general [31]. Second, continuous variables with more than 20% of missing values are eliminated. We take 123 predictors including 11 categorical variables and 112 continuous variables in data preprocessing step. Finally, records with missing values in the data set with 123 predictors are removed from further 7

10 analysis. After preprocessing steps, the training set contains 67,181 observations with 32,862 churners, while the test set contains 34,986 observations with 619 churners, respectively. 3.3 Evaluation metrics In this study, hit rate and lift trend curve are used to numerically quantify the predictive power of models and graphically represent the performance for easy comparison, respectively. The hit rate is defined as the number of correctly identified churners out of churner candidates in this study. When only x% of customers predicted most likely to churn are considered for the model evaluation, it is called hit rate at target point x%. For example, if the model is required to select 1000 customers who are most likely to churn from 10,000 observations, and 200 of them turn out to be actual churners, then a hit rate at target point 10% (1000/10,000=10%) is 20% (200/1000=20%). The lift trend curve shows a lift at a target point x%, where a lift is a ratio of the hit rate of a predictive model divided by the hit rate of a random model. This paper uses the raw number of correctly identified churners over small target points due to the limited budget and time constraints to develop marketing programs. 4. EXPERIMENTAL RESULTS 4.1 Stepwise variable selection vs. PLS variable selection The PLS models can be used to reduce data dimension using the information of variable importance in projection (VIP) score of each variable. We present the ranked list of only variables of which VIP score is higher than or equal to 1.0 in Table 1. According to Wold [39], the researcher may safely remove any independent variable which has a small VIP (<.8) and a small absolute value of regression coefficient. As an exploratory study, we use a set of VIP cut-off criteria values (0, 1.0, 1.2, and 1.5) in this study. We denote PLS 1.0 model as a PLS model with all variables whose VIP scores are greater than or equal to 1.0, while PLS all model utilizes all variables (i.e., PLS 0 ). Therefore, PLS 1.5 model is the most parsimonious PLS model while PLS all model is the most comprehensive model. We also indicate in Table 1 whether each of these variables positively or negatively affects the churning decision of service users. For example, the Number of days (age) of current equipment variable is ranked first because of its highest VIP score (2.995) and it is positively associated with the churning decision of service users, meaning that a service user who has kept her current mobile equipment longer is more likely to switch to another service provider. It is also noted that a variable indicating that a subscriber s phone was refurbished or new one (Handset-refurbished or new) is considered the second most important variable although it is negatively associated with the churning decision of service users. Due 8

11 to the limited space, we focus on the first nine variables in Table 1, which will be PLS 1.5 model because their VIP scores are higher than or equal to 1.5, and leave the interpretation of remaining variables to the reader. Table 1 Variables with VIP scores based on PLS 1.0 model Rank Variable Description VIP Score Churn Impact 1 Number of days (age) of current equipment Handset: refurbished or new Age of first household member Average monthly minutes of use over the life of the customer Range of revenue of voice overage Range of revenue of overage Percentage change in monthly minutes of use vs. previous three month average Range of overage minutes of use Average monthly number of calls over the life of the customer Mean revenue of voice overage Mean overage revenue Account spending limit Mean overage minutes of use Range of revenue (charge amount) Total number of months in service Mean total monthly recurring charge Number of handsets issued Range of number of minutes of use Range of number of attempted voice calls placed Range of number of completed voice calls Range of number of attempted calls Range of number of completed calls Range of number of inbound and outbound peak voice calls Mean monthly revenue (charge amount) Mean rounded minutes of use of customer care calls Number of models issued Range of unrounded minutes of use of peak voice calls Mean unrounded minutes of use of customer care (see CUSTCARE_MEAN) calls Average monthly revenue over the previous three months Average monthly revenue over the life of the customer Mean number of customer care calls Total number of calls over the life of the customer Mean number of monthly minutes of use Number of unique subscribers in the household Billing adjusted total number of calls over the life of the customer Mean number of dropped (failed) voice calls Average monthly number of calls over the previous six months Total minutes of use over the life of the customer Range of rounded minutes of use of customer care calls Range of number of received voice calls Billing adjusted total minutes of use over the life of the customer Average monthly minutes of use over the previous three months

12 43 Range of unrounded minutes of use of customer care calls Range of total monthly recurring charge Range of unrounded minutes of use of completed voice calls Range of number of off-peak voice calls The most interesting fact that we noticed is that the first two most predictive variables are related to mobile handset and it is not very difficult to intuitively link these variables to churn behavior. For example, many mobile phone service subscribers in USA are required to enroll into a two-year contract on the condition that certain handsets are given free. Most subscribers are not likely to churn to another service provider until the contract expires to avoid financial penalty. When the contract period ends soon and/or the utility (or benefit) from adopting a new handset with more services and better interfaces from another service provider outweighs the switching cost, a user is more likely to switch to a new service provider. Therefore, the number of days of current equipment is a strong indicator of whether the current subscriber may switch or stay. The second most important variable, Handset-refurbished or new, implies almost the same insight. Typically, a customer is most likely to purchase a new handset as a promotion package from the current service provider and hence stay with the current subscriber until the end of a contract period. However, if a used or refurbished handset was purchased without any obligation, the customer is more likely to switch to a new service provider. From marketing managers perspective, the aforementioned findings imply that marketers must pay attention to their current users who has used the current handset for almost their contract periods or whose contract periods end in near future. Once users with higher probability of churning are identified, marketing managers immediately contact them to offer customized micromarketing messages before the contract period ends or users switch to another service provider. For example, about one month before the end of the contract period, the marketer may send an appreciation letter to offer a new handset with more advanced functions at a reasonable price as long as the current user agrees to stay with the current service provider for another contract period. Other variables of user s service usage patterns are also informative to predict churn behaviors. For example, heavy service users with higher range of overage revenue and number of minutes of use (5 th, 6 th, and 8 th variables in Table 1), or higher value of average monthly number of calls over the life of the customer (9 th variable in Table 1) are most likely to churn unless they are extremely satisfied with the current service quality. In addition, when users do not use the mobile service as much as they have used in the past three months (7 th variable in Table 1), they are likely to switch. Further, users who have been through many times of billing adjustment over their 10

13 lifetime (4 th variable in Table 1) are most likely to churn partly because it is a possible indicator of poor satisfaction with the current service. Interestingly, the age of first household member affects the user s churning behavior negatively (i.e., older users tend to be more loyal to the current service provider than young users), indicating that young users are more responsive to new service plans associated with the cost, functionality and design of mobile devices. In order to measure the effectiveness of dimension reduction of the proposed PLS models, a simple variable selection procedure, stepwise variable selection, is performed. Once a set of informative variables are selected through the forward selection, they are used as predictor variables of logit regression model to estimate the likelihood of becoming churners. Note that the stepwise variable selection procedure starts with the empty set of variables and greedily adds the variables that most improve performance based on χ 2 scores. It stops adding new variables when no additional variables satisfy the predefined significance level (e.g., =0.05) for entry into the model. Two different values of significance levels are used ( =0.05 and =0.15). From the perspective of model interpretation, a predictive model with a parsimonious variable subset (i.e., is set to 0.05) is preferred as long as the performance of two models is compatible because of improved comprehensibility. Further, the choice of a simpler model is consistent with the well known Occam s razor principle, which states that if all other things being equal, a simpler model generalizes better and hence is preferred [2]. Due to the limited space, we only compare the first nine variables selected by the two models: the stepwise variable selection with =0.05 and PLS 1.5 model-based variable selection (VIP 1.5). These variables are listed in Table 2 in the selection order in the process of stepwise variable selection or VIP scores in the PLS variable selection. Variables in bold are selected via the both models. Table 2 List of variables selected via stepwise variable selection and PLS Variable selection method Stepwise variable selection ( =0.05) PLS 1.5 model-based variable selection (VIP 1.5) Selected variables 1. Number of days of current equipment, 2. Handset (refurbished or new), 3. Range of overage minutes of use, 4. Mean of unrounded minutes of use of completed voice calls, 5. Age of first household member, 6. Total number of months in service, 7. Account spending limit, 8. Billing adjusted total minutes over the life of the customer, 9. Mean of number of minutes of use 1. Number of days of current equipment, 2. Handset (refurbished or new), 3. Age of first household member, 4. Billing adjusted total minutes over the life of the customer, 5. Range of revenue of voice overage, 6. Range of revenue of overage, 7. Percentage change in monthly minutes of use vs. previous 3 month average, 8. Range of overage minutes of use, 9. Average monthly number of calls over the life of the customer 11

14 We immediately noticed that the two most important variables identified from PLS 1.5 model are also chosen in the same order from the stepwise variable selection with =0.05. In addition, the five variables are commonly selected by the two methods. Overall, the selected variables reflect customer s usage behaviors and interactions with the company, which are in line with marketing science work [31]. 4.2 Churner prediction with linear PLS Models In this section, we present the prediction performance of linear PLS models in a prediction task to correctly identify churners. Figure 2 presents the hit rates of linear PLS models combined with four different cut-off values of VIP scores. Unfortunately, Figure 2 shows that PLS models with more input variables are more predictive. Note, however, that all PLS models perform significantly better than a random model of which proportion of hit records is always equal to the proportion of chosen records. Both PLS all and PLS 1.0 models identify more than 30% of churners correctly at a target point 20%. In particular, the performance of PLS 1.0 model is very compatible to PLS all model when a small portion of records are chosen for churn management program, and hence it could be used as an alternative to PLS all model when both costs and benefits of churn management program are considered. Note that with more targets, a market manager needs to spend more money and time to distribute campaign messages and offer a discounted service fees or free handsets. 12

15 Figure 2 Hit rates of linear PLS models Figure 3 presents the relative performance of all linear PLS models compared to a random model. While x axis represents the proportion of chosen records, y axis represents the lift defined as the hit rate of model A divided by the hit rate of a random model at a chosen record proportion. Therefore, lift values higher than 1.0 indicate that the compared model performs better than a random model. We immediately note that both PLS all and PLS 1.0 models identify twice as many hit records (=correctly identified churners) as a random model, while other PLS models such as PLS 1.2 and PLS 1.5 also identify about 60% more churners at a selection point 5%. Naturally, all PLS models show the decreasing trends of lift values as more users with lower estimated probability of churning are considered in a pool of churner candidates. However, we note that parsimonious PLS models with higher VIP cut-off scores, PLS 1.2 and PLS 1.5, show relatively stable lift trends compared to PLS models with lower VIP cut-off score such as PLS all and PLS 1.0 model partially because of their relatively lower lift values. However, we also attribute this finding to the fact that parsimonious PLS models typically generalize better with a smaller number of input variables for a prediction task. 13

16 Figure 3 Lift trends of linear PLS models 4.3 Comparative analysis for churner prediction We also compare the prediction performance of PLS models to other popular classification models in marketing community: two Logit models, an artificial neural networks (ANN) model, a decision tree (DT) classifier, a Naïve Bayes (NB) model, and a random model. For PLS models, we choose a linear PLS model, PLS all, that shows the best performance out of all linear PLS models in Figure 2. We also implement a nonlinear PLS model, PLS kernel, using a Gaussian kernel function, K(x i - x j ) = exp(- x i - x j 2 / k), where x i and x j represent i th and j th records of X, and k is a parameter value of Gaussian kernel function. We compute the values of the kernel function from 6,000 samples with even class distribution from training data to lessen computational overload after we standardized the values of observations for each variable, and the value of k is subjectively set to 200. For comparison purposes, two Logit models are implemented: Logit 0.15 with =0.15 and Logit 0.05 with =0.05. Logit 0.05 is a more parsimonious model because the stepwise variable selection method stops adding new variables with a stricter significance level. Although the two models are very compatible, Logit 0.15 model was slightly better in terms of prediction power and hence was chosen for prediction performance comparison. Note, 14

17 however, that we prefer Logit 0.05 model to Logit 0.15 model because it helps marketing managers to better understand churn behaviors from fewer variables and hence develop new micromarketing programs easily. To implement ANN, DT, and NB classifiers, we use a popular data mining package tool, Weka, which is publicly available at Note that while all these algorithms have been tested on various data sets in almost all business applications [34][40], they require a different number and kind of parameters, and their performance and computational complexity are dependent on a set of parameter values optimized for each data set. For example, an ANN model has been known to be very accurate and robust to noise in data sets with excellent performance [23], but its prediction performance is strongly dependent on various parameters such as a network architecture (e.g., a threelayered back-propagation network), a transfer function (e.g., a sigmoid transformation), the number of hidden nodes, a learning rate, a momentum rate, the weights and biases on each connection among neurons, and a learning epoch. However, we implement a standard three-layered ANN classifier with back-propagation learning with all default parameter values pre-determined by Weka for easy replication of our experimental results. Another popular classifier, DT, provides more intuitive outputs with a list of if-then rules and is relatively free from many parameters compared with ANNs. In our implementation, we use a pruned C4.5 tree [28] based on an entropy-based gain ratio criterion to split records with all default parameter values. The Naïve Bayes model is one of the simplest models that require no parameters, but use Bayes theorem estimate the conditional probability of response values. We use the same hit rate and corresponding lift trend curve for comparing the performance of prediction models, and graphically summarize their performances in Figures 4 and 5. 15

18 Figure 4 Hit rates of single classifiers Figure 5 Lift trends of single classifier models compared to a random model 16

19 First of all, Figure 4 presents the hit rates of six predictive models: a linear PLS classifier (PLS all ), a nonlinear PLS classifier (PLS kernel ), Logit 0.15, a DT, an ANN, and a random model. We note that while both PLS all and PLS kernel models perform best in general, PLS kernel performs slightly better than PLS all model possibly because a non-linear PLS model is suitable to fit a non-linear relationship between churn indicator and all other variables. In particular, PLS kernel relatively performs better at 5% and 10% target points, returning higher lift values compared to the all other models. Surprisingly, a DT and an ANN model show disappointing performances while a simple Logit 0.15 performs reasonably well. In particular, a DT model with default parameter settings results in a very large tree structure with 8,839 leaves (and the total size of tree was 17,677) of which performance is not significantly different from a random model. Considering the fact that its tree structure consists of too many if-then rules, we attribute its low performance to the default settings of tree parameters or the possible overfitting in which a calibrated model explains too well on the training data set, but predicts poorly on new test data set. The performance of an NB is not distinguishable from the performance of a DT, and hence is not shown in Figures 4 and 5. Since no parameter tuning is required for an NB, we attribute its poor performance to the fact that the probability estimates from an NB can be biased due to the unrealistic assumption that input variables are conditionally independent of each other given the value of the class indicator in our study. Figure 5 shows the lift trends of classifier models compared to a random model. All classifiers except a DT model such as PLS all, PLS kernel, Logit 0.15, and an ANN model show the trend of lift values that records the highest at a selection point 5% and then gradually decrease as more prospects are targeted. The lift values of a DT model at 5% and 10% selection points are lower than 1, indicating its performance is worse than a random model, while at other selection points its performance is better than a random model. One thing to note is that there are various performance comparison indices of classifiers such as predictive accuracy, computational complexity in terms of memory and CPU time requirement, and easy interpretation of the resulting models. For example, in our experiment, an ANN model performs significantly better than a random model and a DT model in terms of predictive accuracy, but it takes much longer time to calibrate an ANN model (more than five hours for an ANN vs seconds for a DT), and often its final neural network structure is not immediately understandable due to its black-box nature. Note also that it is possible to boost the performance of all classifier models including not only an ANN and a DT model but also PLS all and PLS kernel models by tuning their parameter values on the currently used data set. In particular, an ensemble model that combines multiple single classifiers may 17

20 significantly boost the predictive performance. Since this comprehensive comparative analysis requires testing numerous models calibrated with different parameter values and ensemble construction strategies, we leave this important issue as one of possible future research directions. 5. SEGMENTATION BASED CHURN MARKETING CAMPAIGNS In this section, we propose a simple marketing campaign that combines our PLS-based churner identification model with a standard clustering analysis to demonstrate the applicability of the proposed approach in a real-world situation and to recommend the most appropriate campaign size at which the profit from marketing campaign is maximized. The overall procedure for marketing campaign is as follows: For a chosen campaign size (e.g., the top 5% of customers based on their churn probability), we first segment customers into five different groups using a K-Means algorithm, one of the most well known clustering methods. Note that both the number of clusters and the properties of variables used for clustering are the main factors that determine the outcomes of clustering analysis. Although how to identify the optimal number of clusters has yet to be developed, theoretically, the number of clusters ranges between one and the number of data points. In this study, we subjectively form five clusters for easy demonstration of our approach. In terms of variables for clustering analysis, we note that the comprehensibility and applicability of marketing programs are important for marketing managers. Therefore, we first identify a set of variables that are known to be most significant factors based on their highest VIP scores to explain customer churn behaviors. Then, we finally select seven variables necessary for the development and implementation of marketing strategies. Seven variables and marketing programs will be discussed in detail later. Once five clusters of customers are formed, we analyze the characteristics of each cluster to determine which marketing strategy is most suitable to apply, and apply only the best fit marketing strategy to each cluster. Then, we compute revenue and cost to verify the effectiveness of marketing campaign from each cluster. By repeating the aforementioned steps over a set of campaign sizes (e.g., top 5%, 10%, 15%, 20%, 25% and 30% of customer groups based on their churn probabilities), we also recommend the best campaign size in terms of estimated marketing profits. 5.1 Marketing program development Our proposed marketing program consists of three marketing strategies: device management strategy (DMS), overage management strategy (OMS), and complaints management strategy (CMS). The main purpose of DMS is to provide incentives to renew another contract period for customers who 18

21 are most likely to churn when their contract period is close to expiration, while OMS focuses on soliciting current overage users for premium services at marginal increases of service fees. Finally, CMS is to attract the customers experienced with technical difficulties and accounting discrepancies by providing highly responsive services and financial incentives. Each marketing strategy is based on one or two variables out of seven variables used for customer clustering. The relationships between these strategies and cluster characteristics will be detailed in Section 5.2. The core element in the process of implementing DMS is to provide free mobile devices for the customers whose service contract period will expire soon and thus to have them extend their service contract for two more years. Then the estimated revenue from each true positive (TP) customer will be his or her monthly service fee (RevMean) for 24 months, where a TP customer is an actual churner among customers who are predicted as churners in the original data set. For all analyses from now on, we use a TP customer as a surrogate customer who is predicted to accept DMS offer (or other marketing offers such as OMS or CMS) and who actually accept it. Further, we assume that all TP customers accept the marketing offer and stay with the current service provider. Although we acknowledge that this is a very strong assumption, we believe that the general framework of demonstrating comparative effectiveness of marketing campaigns is still valid. When we apply the same DMS to all the customers in the same cluster, the total estimated revenue will be the sum of estimated revenue from each TP customer. In terms of marketing cost of DMS, we consider $5 as the handling cost for each customer and $100 as a new device cost for each TP customer. Note, the handling cost includes postage, customer care call ($2 per call), and any other small costs incurred to promote the strategy. We summarize the formula to compute the revenue and profit of DMS as follows: Revenue (DMS) = n i 1 (RevMean i TP i 24months) (1) Cost (DMS) = n i 1 (Handling cost Device cost TP ) i (2) where TP i is a Boolean indicator whether or not a customer i belongs to a TP group and n is the number of customers in a chosen cluster considered for a specific marketing strategy (here, DMS). The main purpose of OMS is to solicit current overage users who frequently overuse their voice and data plan beyond their allowance for premium services at marginal increases of service fees. Heavy users may benefit from this OMS offer because they enjoy upgraded services at only marginal cost 19

22 increase and avoid unexpected and expensive overage charges. The service providers also benefit from OMS because they can turn uncertain cash flow from monthly overage charges into steady and predictable cash flow. To compute the revenue from OMS for each TP customer who upgrades her/her service plans, we assume that each TP customer has been enrolling in the current service on average 12 months (i.e., 12 more months to complete until the end of their current service contract). Then, the revenue from OMS for each TP customer will be the sum of her/her current service fee (RevMean) and additional fee to upgrade his/her current service into a next available premium service for coming 12 months. The cost components of OMS include the handling cost of $5 for all customers, and the promotion cost of waiving upgrade costs for first three months for each TP customer as a token of gratitude. We summarize the formula to compute the revenue and cost of OMS as follows: n Revenue (OMS) = (RevMean Upgrade cost ) TP 12months (3) Cost (OMS) = n i 1 i 1 (Handling cost i i i Upgrade costi TP i 3 months) (4) where Upgradecost i represents an upgrade cost for a customer i from the current service plan to the next available premium service, while TP i is a Boolean indicator whether or not a customer i belongs to a TP group. The last strategy, CMS, is targeting the customers who are not satisfied with services and made many complaint calls to the service centers, causing high operating costs (typically $2 per call). Therefore, the objective of CMS is to make these customers satisfied with the customer care service to reduce complaints calls accordingly and stay with the current service plan for the remaining contract period. The core part of CMS implementation is to credit 1% of monthly service fee of each customer who made multiple complaint calls after their complaints are verified as legitimate by customer care operators. The formula of estimating revenue is the sum of monthly service charges (RevMean) from all the TP customers for remaining service contract period of 12 months on average. The estimated cost is the sum of $5 as handling cost for all customers and 1% of monthly service charge of each TP customer for 12 months. The profit of this strategy can be calculated as follows: Revenue (CMS) = n i 1 (RevMean i TP i 12months) (5) 20