PROVEN PRACTICES FOR PREDICTIVE MODELING

Size: px
Start display at page:

Download "PROVEN PRACTICES FOR PREDICTIVE MODELING"

Transcription

1 PROVEN PRACTICES FOR PREDICTIVE MODELING BROUGHT TO YOU BY SAS CUSTOMER LOYALTY CONTRIBUTIONS FROM: DARIUS BAER DAVID OGDEN DOUG WIELENGA MARY-ELIZABETH ( M-E ) EDDLESTONE PRINCIPAL SYSTEMS ENGINEER, ANALYTICS

2 BEST PRACTICES DISCLAIMERS The choice of Best Practices is highly subjective. Certain suggested practices may not be suitable for a particular situation. It is the responsibility of a data mining practitioner to critically evaluate methods and select the best method for a particular situation. This presentation represents the opinions of the contributors.

3 BEST PRACTICES TO HELP YOU MEET AND EXCEED YOUR GOALS Faster model development More useful models Superior models

4 BEST PRACTICES FORMAT OF PRESENTATION Background & General Guidance Developing the Data Developing & Delivering the Model

5 BACKGROUND &GENERAL GUIDANCE ANALYTICS CYCLE AND THE MODELING PROCESS

6 ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models Develop Models Transform & Select

7 BEST PRACTICE IT S ALL ABOUT BALANCE

8 BEST PRACTICE IT S ALL ABOUT BALANCE Many factors need to be considered and optimized: Time People Money IT Resources Business Process People Technology

9 ANALYTICS INFRASTRUCTURE Fact-based decision making requires the right technology, talent, processes and culture Continuous Process Improvement Planning Project methodology Standards BUSINESS PROCESS TECHNOLOGY DECISION ANALYTICS PEOPLE BI reporting Web portals / dashboards Information management Problem-specific business solutions Predictive analytics Hardware Vision & Leadership Team composition Enterprise authority

10 LIFECYCLE BEST PRACTICE INVOLVE ALL THE RELEVANT PEOPLE/ROLES BUSINESS MANAGER Domain Expert Makes Decisions Evaluates Processes & ROI Deploy Model Evaluate & Monitor Model Formulate Problem Data Preparation Data Exploration DATA MINER Exploratory Analysis Descriptive Segmentation Predictive Modeling Model Validation & Registration BUSINESS ANALYST Data Exploration Data Visualization Report Creation Validate Models Develop Models Transform & Select IT/SYSTEMS MANAGEMENT Model Validation Model Deployment Model Monitoring Data Preparation

11 BEST PRACTICE WEAR MANY HATS Have a passion to understand not just analytics, but the business and technology.

12 BEST PRACTICE USE THE TECHNOLOGY AND METHOD THE FITS THE JOB Every tool and method has advantages and disadvantages. Whenever possible, select the tool or method that balances long-term goals for the entire process.

13 BEST PRACTICE BEGIN WITH THE END IN MIND

14 BEST PRACTICE BEGIN WITH THE END IN MIND What? How? Who? When?

15 BEST PRACTICE BEGIN WITH THE END IN MIND What is the overarching strategic objective/initiative? How will the model be used? How will it be put into production?

16 BEST PRACTICE BEGIN WITH THE END IN MIND Who will be affected by the use of the model? Who needs to be convinced of the value of the model? When will the model be used?

17 BEST PRACTICES BUSINESS CONSIDERATIONS BEFORE YOU MODEL Thoroughly understand the business/marketing objectives Detail the precise (planned) usage for the output Define the target variable (the outcome being modeled / predicted) Formulate a theoretical model: Y = f (X 1, X 2, ) fill-in the likely X s

18 BEST PRACTICE SEMMA Process for Model Development

19 BEST PRACTICE MODELING APPROACH 1. Sample training set(s), validation set(s), holdout test set 2. Explore min, max, mean, median, missing values, levels (categorical cardinality) 3. Modify filtering outliers, reducing cardinality, correcting multicolinearity, imputations, nonlinear transformations

20 BEST PRACTICE MODELING APPROACH 4. Model variable selection, various model formulations, iterative cycle, insights & client reviews 5. Assess performance criteria and review

21 BEST PRACTICE MODELING APPROACH (CONTINUED) 7. Final Assessment & Testing 8. Profile characteristics & indicators 9. Document results 10. Prepare (production-ready) data collection and score code 11. Monitor model performance

22 DEVELOPING THE DATA

23 BEST PRACTICES OPTIMIZING DATA Determining Data Selecting Target Preparing Variables

24 DETERMINING DATA

25 BEST PRACTICES TECHNICAL CONSIDERATIONS BEFORE MODELING Brainstorm all potential input data elements Identify source systems, specific data fields, availability/priority/level-of-effort of data Finalize data to be collected

26 BEST PRACTICES TECHNICAL CONSIDERATIONS BEFORE MODELING Formulate structure and layout of modeling dataset to be built Devil-in-the-details: filters, timeframe of history, etc Build modeling dataset

27 BEST PRACTICE ALLOW SUFFICIENT TIME FOR ALL ASPECTS

28 BEST PRACTICE SEMMA Process for Model Development

29 SAMPLE (Over) Sampling Decisioning Partitioning

30 SAMPLING

31 SAMPLE TO SAMPLE OR NOT? Sampling is a valuable tool that can be used to great effect. If computing resources are no object, it s possible to use all data. When resource constrained, try increasing sample sizes as model development progresses. When model is nearly finalized, try different SAMPLE SAMPLE ALL DATA seeds for samples to ensure model stability.

32 SAMPLE WHAT ABOUT OVERSAMPLING? It depends. Frequently one needs to oversample in order to allow algorithm(s) to identify effect, especially with rare targets. Only oversample as much as you need to in order to obtain a model that makes sense from a business perspective. This is highly subjective.

33 SELECTING TARGET

34 CHOOSING YOUR TARGET Choosing the Target Response vs. Propensity Number of Models

35 DECISIONING

36 WEIGHTING YOUR DECISIONS Expected Profit Decision Boundaries

37 UNDERSTANDING EXPECTED PROFIT Consider this game Flip a fair coin one time If it is heads, you win $10.00 Cost of playing one time is $1.00 Do you want to play this game?

38 UNDERSTANDING EXPECTED PROFIT Consider this game Flip a fair coin one time If it is heads, you win $10.00 Cost of playing one time is $1.00 E(Profit) = 0.5 * (10-1) * (-1) = (-0.50) = 4.00

39 DECISIONS COMBINING THE WEIGHTS WITH PRIORS

40 DECISIONS DETERMINING DECISION WEIGHTS To determine the amount of weight to assign to the rare event in a binary target, calculate this ratio: Specify the weight of the rare event to be this ratio

41 DECISIONS DETERMINING DECISION WEIGHTS: EXAMPLE Consider a binary event where Prob(Yes) = 0.1 and Prob(No) = 0.9 To determine the amount of weight to assign to the rare event in a binary target, calculate this ratio: =. = 9. Specify the weight of the rare event to be this ratio

42 DECISIONS INCORPORATING PRIORS Before fitting model Decision Profile After fitting model Decision Node

43 PARTITIONING

44 SAMPLE DATA PARTITIONING PARTITION ROLE Training Used to fit the model Validation Used to validate the model and prevent over-fitting Test Used to provide unbiased estimate of model performance

45 SAMPLE SAMPLE: DATA PARTITIONING WHAT IS OPTIMAL PARTITION? 30% 30% 40% Training Validation Test

46 BEST PRACTICE SAMPLE: DATA PARTITIONING 40% WHAT IS OPTIMAL PARTITION? 0% 60% Training Validation Test

47 BEST PRACTICE SAMPLE: DATA PARTITIONING WHAT IS OPTIMAL PARTITION? 30% 0% Training Validation It depends! Test 70%

48 SAMPLE DATA PARTITIONING CONSIDERATIONS How much data is available? Is an unbiased measure of model performance required? Should test data be in-sample or out-of-sample? How many test samples are needed? (e.g. different time periods, different geographies, etc.) When should test data be used in the process?

49 BEST PRACTICE DATA PARTITIONING Percentages: frequently used percentages are 50/50/0, 60/40/0 and 70/30/0 with a completely separate Test partition. Do not bring Test data into process until model is complete. It should not influence modeling process, merely used to report performance. Multiple Test data can be used consider how model will be deployed and create representative samples.

50 PREPARING DATA

51 EXPLORE & MODIFY ITERATIVE RELATIONSHIP WITH DATA PREPARATION Data Prep Data Modification Data Exploration

52 BEST PRACTICE EXPLORE & MODIFY: GETTING THE MOST OUT OF DATA Once you have an analytics-ready table: Examine Categorical Variables Examine Continuous Variables Explore Missing Values Cluster Variables

53 EXPLORE & MODIFY CATEGORICAL VARIABLES

54 EXPLORE & MODIFY CONTINUOUS VARIABLES

55 EXPLORE & MODIFY MISSING DATA Why is data missing? Are there patterns to the missing data within or across variables? Imputation methods to consider

56 EXPLORE & MODIFY CLUSTER VARIABLES There is no single answer for clusters Design clusters and profiles around themes using smaller set of related variables

57 SELECTING VARIABLES

58 EXPLORE & MODIFY VARIABLE SELECTION/REDUCTION TECHNIQUES Stepwise Regression Variable Selection Node Decision Tree Node Variable Clustering Combined Approach

59 EXPLORE & MODIFY VARIABLE SELECTION/REDUCTION TECHNIQUES Multicollinearity

60 EXPLORE & MODIFY VARIABLE SELECTION/REDUCTION TECHNIQUES Interactions

61 BEST PRACTICES OPTIMIZING DATA Selecting Target Determining Data Preparing Variables

62 DEVELOPING & DELIVERING THE MODEL

63 MODEL & ASSESS DELIVERING THE MODEL Developing Your Model Choosing a Model Deploying the Model

64 DEVELOPING THE MODEL

65 MODEL MODEL DEVELOPMENT Regression Decision Trees Neural Networks Ensemble Rule Induction Something Else?

66 BEST PRACTICE MODEL DEVELOPMENT Try various techniques and combinations of techniques.

67 CHOOSING A MODEL

68 BEST PRACTICES MODEL SELECTION Evaluate model metrics Consider business knowledge Recognize constraints

69 ASSESS CUMULATIVE CHARTS

70 ASSESS NON-CUMULATIVE CHARTS

71 DEPLOYING THE MODEL

72 BEST PRACTICES MODEL DEPLOYMENT Reporting Results Clean up and back up Monitor performance

73 BEST PRACTICES MODEL DEPLOYMENT Incorporate and share knowledge Automate ETL (Extract, Transform, Load) Automate process

74 ULTIMATE GOAL SOURCE / OPERATIONAL SYSTEMS SAS MODEL FACTORY DATA PREPARATION MODEL DEVELOPMENT MODEL DEPLOYMENT MODEL MANAGEMENT

75 MODEL & ASSESS DELIVERING THE MODEL Developing Your Model Choosing a Model Deploying the Model

76 BEST PRACTICES FORMAT OF PRESENTATION Background & General Guidance Developing the Data Developing & Delivering the Model

77 BEST PRACTICE BE ANALYTICALLY SAVVY AND CREATIVE analytical creative It s both science and art!

78 THANK YOU FOR USING SAS! Copyright 2013 SAS Institute Inc. All rights reserved.