Survival Data Mining Gordon S. Linoff Founder Data Miners, Inc. gordon@data-miners.com
What to Expect from this Talk Background on survival analysis from a data miner s perspective Introduction to key ideas in survival analysis hazards survival competing risks Lots of examples Stratification Quantifying Loyalty Effort Voluntary and Involuntary Churn Forecasting Time to Reactivation and Re-purchase http://www.data-miners.com 2
Who Am I? I am not a statistician Adept with databases and advanced algorithms Founded Data Miners with Michael Berry in 1998 We have written three books on data mining Have become very interested in survival analysis for mining customer data survival data mining http://www.data-miners.com 3
What Does Data Mining Really Do? Provides ways to quantitatively measure what business users know or should know qualitatively Connects data to business practices Used to understand customers Occasionally, produces interesting predictive models http://www.data-miners.com 4
Data Mining Is About Customers Customer Relationship Lifetime Prospect New Customer Established Customer Former Customer Events Initial Purchase Sign-Up 2nd Purchase Usage Failure to Pay Voluntary Churn Customers Evolve Over Time http://www.data-miners.com 5
The State of the Customer Relationship Changes Over Time Starts P1 Starts P1 P2 P1 P1 P2 STOP P1 P2 P1 P3 Starts P1 P3 P1 http://www.data-miners.com 6 STOP
Traditional Approach to Data Mining Uses Predictive Modeling Jan Feb Mar Apr May Jun Jul Aug Sep Model Set 4 3 2 1 +1 http://www.data-miners.com 7 Score Set 4 3 2 1 +1 Data to build the model comes from the past Predictions for some fixed period in the future Present when new data is scored Models built with decision trees, neural networks, logistic regression, regression, and so on
Survival Data Mining Adds the Element of When Things Happen Time-to-event analysis Terminology comes from the medical world which patients survive a treatment, which patients do not Can measure effects of variables (initial covariates or time-dependent covariates) on survival time Natural for understanding customers Can be used to quantify marketing efforts http://www.data-miners.com 8
Example Results (Made Up) 99.9% of Life bulbs will last 2000 hours Mean-time-to-failure for hard disk is 500,000 hours A recent study published in the American Journal of Public Health found that life expectancy among smokers who quit at age 35 exceeded that of continuing smokers by 6.9 to 8.5 years for men and 6.1 to 7.7 years for women. [www.med.upenn.edu/tturc/pdf/benefits.pdf ] Example of initial covariate Stopping smoking before age 50 increases lifespan by one year for every decade before 50 http://www.data-miners.com 9
Original Statistics Life table methods used by actuaries for a long, long time These the are the methods we will be focused on Applied with vigor to medicine in the mid-20th Century Applied with vigor to manufacturing during 20th Century as well Took off with Sir David Cox s Proportion Hazards Models in 1972 that provide effects of initial covariates http://www.data-miners.com 10
Survival for Marketing Has Some Differences We are happy with discrete time (probability vs rate) Traditional survival analysis uses continuous time Marketers have hundreds of thousands or millions of examples Traditional survival analysis might be done on dozens or hundreds of participants We have the benefit of a wealth of data Traditional survival analysis looks at factors incorporated into the study We have to deal with window effects due to business practices and database reality Traditional survival ignores left truncation http://www.data-miners.com 11
To Understand the Calculation, Let s Focus on the End of the Relationship START tenure is the duration of the relationship STOP Challenge defining beginning of customer relationship Challenge defining end of customer relationship Challenge finding either in customer databases http://www.data-miners.com 12
How Long Will A Customer Survive? Percent Survived 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 12 24 36 48 60 72 Tenure (months) 84 High End Regular 96 108 120 Survival always starts at 100% and declines over time If everyone in the model set stopped, then survival goes to 0; otherwise it is always greater than 0 Survival is useful for comparing different groups http://www.data-miners.com 13
Key Idea: Data is Censored Known tenure when customers stop Known Customer Start time http://www.data-miners.com 14 Unknown tenure for active customers (censored)
Use Censored Data to Calculate Hazard Probabilities Hazard, h(t), at time t is the probability that a customer who has survived to time t will not survive to time t+1 h(t) = # customers who stop at exactly time t # customers at risk of stopping at time t Value of hazard depends on units of time days, weeks, months, years Differs from traditional definition because time is discrete hazard probability not hazard rate http://www.data-miners.com 15
Bathtub Hazard (Risk of Dying by Age) 3.0% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% http://www.data-miners.com 16 Bathtub hazard starts high, goes down, and then increases again Example from US mortality tables shows probability of dying at a given age 0-1 yrs 1-4 yrs 5-9 yrs 10-14 yrs 15-19 yrs 20-24 yrs 25-29 yrs 30-34 yrs 35-39 yrs 40-44 yrs 45-49 yrs 50-54 yrs 55-59 yrs 60-64 yrs Haz ard 65-69 yrs 70-74 yrs Age (Years)
Hazards Are Like An X-Ray Into Customers Hazard (Daily Risk of Churn) 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% 0.00% Peak here is for nonpayment and end of initial promotion 0 60 120 180 240 300 360 420 480 540 600 660 720 Tenure (Days) Bumpiness is due to within week variation http://www.data-miners.com 17
Hazards Can Show Interesting Features of the Customer Lifecycle Daily Churn Hazard 8% 7% 6% 5% 4% 3% 2% Peak here is for nonpayment after one year Peak here is for nonpayment after two years 1% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 630 660 690 720 750 780 810 840 870 900 930 960 990 Tenure (Days After Start) http://www.data-miners.com 18
How Long Will A Customer Survive? Survival analysis answers the question by rephrasing it slightly: What proportion of customers survive to time t? Survival at time t, S(t), is the probability that a customer will survive exactly to time t Calculation: S(t) = S(t 1) * (1 h(t 1)) S(0) = 100% Given a set of hazards by time, survival can be easily calculated in SAS or a spreadsheet http://www.data-miners.com 19
Survival is Similar to Retention Retention/Survival 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Retention is jagged. It is calculated on customers who started exactly w weeks ago. 0 10 20 30 40 50 60 70 80 90 100 110 Tenure (Weeks) Survival is smooth. It is calculated using customers who started up to w weeks ago. http://www.data-miners.com 20
Why Survival is Useful Compare hazards for different groups of customers Calculate median time to event How long does it take for half the customers to leave? Calculate truncated mean tenure What is the average customer tenure for the first year after starting? For the first two years? Can be used to quantify effects http://www.data-miners.com 21
Median Lifetime is the Tenure Where Survival = 50% 100% 90% Percent Survived 80% 70% 60% 50% 40% 30% 20% 10% 0% High End Regular 0 12 24 36 48 60 72 Tenure (months) 84 96 108 120 http://www.data-miners.com 22
Average Tenure Is The Area Under The Curves 100% 90% 80% Percent Survived 70% 60% 50% 40% 30% 20% 10% average 10-year tenure regular customers = 44 months (3.7 years) High End Regular average 10-year tenure high end customers = 73 months (6.1 years) 0% 0 12 24 36 48 60 Tenure 72 84 96 108 120 http://www.data-miners.com 23
Survival to Quantify Marketing Efforts A company has a loyalty marketing effort designed to keep customers This effort costs money What is the value of the effort as measured in increased customer lifetimes? SOLUTION: survival analysis How much longer to customers survive after accepting the offer? How to quantify this in dollars and cents? http://www.data-miners.com 24
Loyalty Offers is an Example of a Time- Dependent Covariate Non-Responders have no loyalty date Responders have a loyalty date Open Circle indicates that customer has stopped before (or on) the analysis date time Time = 0 Time = n Time c http://www.data-miners.com 25
We Can Use Area to Quantify Results 100% Percent Retained 95% 90% 85% 80% 75% 70% How much is this difference worth? 65% Group 1 Group 2 0 2 4 6 8 10 12 14 16 18 Months After Intervention Chart shows survival after loyalty offer acceptance compared to similar group not given offer http://www.data-miners.com 26
We Can Use Area to Quantify Results Percent Retained 100% 95% 90% 85% 80% 75% 70% 65% 0 2 Group 1 Group 2 4 6 8 10 93.4% 85.6% 12 14 Months After Intervention 16 18 Increase in survival is given by the area between the curves. For the first year, area of triangle is a good enough estimate Note: there are easy ways to calculate the exact value http://www.data-miners.com 27
Loyalty-Responsive Customers Are Doing Better Than Others Survival for first year after loyalty intervention Group 1: 93.4% Group 2: 85.6% Increase for Group 1: 7.8% Average increase in lifetime for first year is 3.9% (assuming the 7.8% would have stayed, on average, 6 months) Assuming revenue of $400/year, loyalty responsive contribute an additional revenue of $15.60 during the first year This actually compares favorably to cost of loyalty program http://www.data-miners.com 28
Competing Risks: Customers May Leave for Many Reasons Customers may cancel Voluntarily Involuntarily Migration Survival Analysis Can Show Competing Risks overall S(t) is the product of S r (t) for all risks We ll walk through an example Overall survival Competing risks survival Competing risks hazards Stratified competing risks survival http://www.data-miners.com 29
Overall Survival for a Group of Customers 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Sharp drop during this period, otherwise smooth Overall 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 http://www.data-miners.com 30
Competing Risks for Voluntary and Involuntary Churn 100% 90% 80% 70% 60% 50% 40% 30% Steep drop is associated with involuntary churn Involuntary Voluntary Overall 20% 10% 0% Survival for each risk is always greater than overall survival 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 http://www.data-miners.com 31
What the Hazards Look Like 9% 8% 7% As expected, steep drop has a very high hazard Involuntary Voluntary 6% 5% 4% 3% 2% 1% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 http://www.data-miners.com 32
Most Involuntary Churn is from Non Credit Card Payers 100% 90% 80% 70% 60% 50% Involuntary-CC Voluntary-CC Involuntary-Other Voluntary-Other Steep drop associated only with non-credit card payers 40% 30% 20% 10% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 http://www.data-miners.com 33
Time to Reactivation When a customer stops, often they come back this is winback or reactivation In this case, the initial condition is the stop The final condition is the restart Not all customers restart http://www.data-miners.com 34
Survival Can Be Applied to Other Events: Reactivation 100% 10% 90% 9% Survival (Remain Deactivated) 80% 70% 60% 50% 40% 30% 20% 10% 8% 7% 6% 5% 4% 3% 2% 1% Hazard Probability ("Risk" of Reactivating) 0% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 Days After Deactivation http://www.data-miners.com 35
Time to Next Order In retailing type businesses, customers make multiple purchases Survival analysis can be applied here, too The question is: how long to the next purchase? Initial state is: date of purchase Final state is: date of next purchase It is better to look at 1 survival rather than survival http://www.data-miners.com 36
Time to Next Purchase, Stratified by Number of Previous Purchases 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 05-->06 04-->05 03-->04 02-->03 ALL 01-->02 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 http://www.data-miners.com 37
Customer-Centric Forecasting Using Survival Analysis Forecasting customer numbers is important in many businesses Survival analysis makes it possible to do the forecasts at the finest level of granularity the customer level Survival analysis makes it possible to incorporate customer-specific factors Can be used to estimate restarts as well as stops http://www.data-miners.com 38
Using Survival for Customer Centric Forecasting Number Actual Predicted 130 120 110 100 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Day of Month http://www.data-miners.com 39
The Forecasting Solution is a Bit Complicated Existing Customer Base New Start Forecast Existing Customer Base Forecast Do Existing Base Forecast (EBF) Do Existing Base Churn Forecast (EBCF) Do New Start Forecast (NSF) Do New Start Churn Forecast (NSCF) New Start Forecast Churn Forecast Churn Actuals Compare http://www.data-miners.com 40
Survival Data Mining Connects Data to Business Needs Provides ways to quantitatively measure what business users know or should know qualitatively Connects data to business practices Techniques such as survival analysis provide new ways of looking at customers Techniques such as customer-centric forecasting integrate data mining with business processes such as forecasting http://www.data-miners.com 41