PROVEN PRACTICES FOR PREDICTIVE MODELING
|
|
- Isabella Sparks
- 5 years ago
- Views:
Transcription
1 PROVEN PRACTICES FOR PREDICTIVE MODELING BROUGHT TO YOU BY SAS CUSTOMER LOYALTY CONTRIBUTIONS FROM: DARIUS BAER DAVID OGDEN DOUG WIELENGA MARY-ELIZABETH ( M-E ) EDDLESTONE PRINCIPAL SYSTEMS ENGINEER, ANALYTICS
2 BEST PRACTICES DISCLAIMERS The choice of Best Practices is highly subjective. Certain suggested practices may not be suitable for a particular situation. It is the responsibility of a data mining practitioner to critically evaluate methods and select the best method for a particular situation. This presentation represents the opinions of the contributors.
3 BEST PRACTICES TO HELP YOU MEET AND EXCEED YOUR GOALS Faster model development More useful models Superior models
4 BEST PRACTICES FORMAT OF PRESENTATION Background & General Guidance Developing the Data Developing & Delivering the Model
5 BACKGROUND &GENERAL GUIDANCE ANALYTICS CYCLE AND THE MODELING PROCESS
6 ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models Develop Models Transform & Select
7 BEST PRACTICE IT S ALL ABOUT BALANCE
8 BEST PRACTICE IT S ALL ABOUT BALANCE Many factors need to be considered and optimized: Time People Money IT Resources Business Process People Technology
9 ANALYTICS INFRASTRUCTURE Fact-based decision making requires the right technology, talent, processes and culture Continuous Process Improvement Planning Project methodology Standards BUSINESS PROCESS TECHNOLOGY DECISION ANALYTICS PEOPLE BI reporting Web portals / dashboards Information management Problem-specific business solutions Predictive analytics Hardware Vision & Leadership Team composition Enterprise authority
10 LIFECYCLE BEST PRACTICE INVOLVE ALL THE RELEVANT PEOPLE/ROLES BUSINESS MANAGER Domain Expert Makes Decisions Evaluates Processes & ROI Deploy Model Evaluate & Monitor Model Formulate Problem Data Preparation Data Exploration DATA MINER Exploratory Analysis Descriptive Segmentation Predictive Modeling Model Validation & Registration BUSINESS ANALYST Data Exploration Data Visualization Report Creation Validate Models Develop Models Transform & Select IT/SYSTEMS MANAGEMENT Model Validation Model Deployment Model Monitoring Data Preparation
11 BEST PRACTICE WEAR MANY HATS Have a passion to understand not just analytics, but the business and technology.
12 BEST PRACTICE USE THE TECHNOLOGY AND METHOD THE FITS THE JOB Every tool and method has advantages and disadvantages. Whenever possible, select the tool or method that balances long-term goals for the entire process.
13 BEST PRACTICE BEGIN WITH THE END IN MIND
14 BEST PRACTICE BEGIN WITH THE END IN MIND What? How? Who? When?
15 BEST PRACTICE BEGIN WITH THE END IN MIND What is the overarching strategic objective/initiative? How will the model be used? How will it be put into production?
16 BEST PRACTICE BEGIN WITH THE END IN MIND Who will be affected by the use of the model? Who needs to be convinced of the value of the model? When will the model be used?
17 BEST PRACTICES BUSINESS CONSIDERATIONS BEFORE YOU MODEL Thoroughly understand the business/marketing objectives Detail the precise (planned) usage for the output Define the target variable (the outcome being modeled / predicted) Formulate a theoretical model: Y = f (X 1, X 2, ) fill-in the likely X s
18 BEST PRACTICE SEMMA Process for Model Development
19 BEST PRACTICE MODELING APPROACH 1. Sample training set(s), validation set(s), holdout test set 2. Explore min, max, mean, median, missing values, levels (categorical cardinality) 3. Modify filtering outliers, reducing cardinality, correcting multicolinearity, imputations, nonlinear transformations
20 BEST PRACTICE MODELING APPROACH 4. Model variable selection, various model formulations, iterative cycle, insights & client reviews 5. Assess performance criteria and review
21 BEST PRACTICE MODELING APPROACH (CONTINUED) 7. Final Assessment & Testing 8. Profile characteristics & indicators 9. Document results 10. Prepare (production-ready) data collection and score code 11. Monitor model performance
22 DEVELOPING THE DATA
23 BEST PRACTICES OPTIMIZING DATA Determining Data Selecting Target Preparing Variables
24 DETERMINING DATA
25 BEST PRACTICES TECHNICAL CONSIDERATIONS BEFORE MODELING Brainstorm all potential input data elements Identify source systems, specific data fields, availability/priority/level-of-effort of data Finalize data to be collected
26 BEST PRACTICES TECHNICAL CONSIDERATIONS BEFORE MODELING Formulate structure and layout of modeling dataset to be built Devil-in-the-details: filters, timeframe of history, etc Build modeling dataset
27 BEST PRACTICE ALLOW SUFFICIENT TIME FOR ALL ASPECTS
28 BEST PRACTICE SEMMA Process for Model Development
29 SAMPLE (Over) Sampling Decisioning Partitioning
30 SAMPLING
31 SAMPLE TO SAMPLE OR NOT? Sampling is a valuable tool that can be used to great effect. If computing resources are no object, it s possible to use all data. When resource constrained, try increasing sample sizes as model development progresses. When model is nearly finalized, try different SAMPLE SAMPLE ALL DATA seeds for samples to ensure model stability.
32 SAMPLE WHAT ABOUT OVERSAMPLING? It depends. Frequently one needs to oversample in order to allow algorithm(s) to identify effect, especially with rare targets. Only oversample as much as you need to in order to obtain a model that makes sense from a business perspective. This is highly subjective.
33 SELECTING TARGET
34 CHOOSING YOUR TARGET Choosing the Target Response vs. Propensity Number of Models
35 DECISIONING
36 WEIGHTING YOUR DECISIONS Expected Profit Decision Boundaries
37 UNDERSTANDING EXPECTED PROFIT Consider this game Flip a fair coin one time If it is heads, you win $10.00 Cost of playing one time is $1.00 Do you want to play this game?
38 UNDERSTANDING EXPECTED PROFIT Consider this game Flip a fair coin one time If it is heads, you win $10.00 Cost of playing one time is $1.00 E(Profit) = 0.5 * (10-1) * (-1) = (-0.50) = 4.00
39 DECISIONS COMBINING THE WEIGHTS WITH PRIORS
40 DECISIONS DETERMINING DECISION WEIGHTS To determine the amount of weight to assign to the rare event in a binary target, calculate this ratio: Specify the weight of the rare event to be this ratio
41 DECISIONS DETERMINING DECISION WEIGHTS: EXAMPLE Consider a binary event where Prob(Yes) = 0.1 and Prob(No) = 0.9 To determine the amount of weight to assign to the rare event in a binary target, calculate this ratio: =. = 9. Specify the weight of the rare event to be this ratio
42 DECISIONS INCORPORATING PRIORS Before fitting model Decision Profile After fitting model Decision Node
43 PARTITIONING
44 SAMPLE DATA PARTITIONING PARTITION ROLE Training Used to fit the model Validation Used to validate the model and prevent over-fitting Test Used to provide unbiased estimate of model performance
45 SAMPLE SAMPLE: DATA PARTITIONING WHAT IS OPTIMAL PARTITION? 30% 30% 40% Training Validation Test
46 BEST PRACTICE SAMPLE: DATA PARTITIONING 40% WHAT IS OPTIMAL PARTITION? 0% 60% Training Validation Test
47 BEST PRACTICE SAMPLE: DATA PARTITIONING WHAT IS OPTIMAL PARTITION? 30% 0% Training Validation It depends! Test 70%
48 SAMPLE DATA PARTITIONING CONSIDERATIONS How much data is available? Is an unbiased measure of model performance required? Should test data be in-sample or out-of-sample? How many test samples are needed? (e.g. different time periods, different geographies, etc.) When should test data be used in the process?
49 BEST PRACTICE DATA PARTITIONING Percentages: frequently used percentages are 50/50/0, 60/40/0 and 70/30/0 with a completely separate Test partition. Do not bring Test data into process until model is complete. It should not influence modeling process, merely used to report performance. Multiple Test data can be used consider how model will be deployed and create representative samples.
50 PREPARING DATA
51 EXPLORE & MODIFY ITERATIVE RELATIONSHIP WITH DATA PREPARATION Data Prep Data Modification Data Exploration
52 BEST PRACTICE EXPLORE & MODIFY: GETTING THE MOST OUT OF DATA Once you have an analytics-ready table: Examine Categorical Variables Examine Continuous Variables Explore Missing Values Cluster Variables
53 EXPLORE & MODIFY CATEGORICAL VARIABLES
54 EXPLORE & MODIFY CONTINUOUS VARIABLES
55 EXPLORE & MODIFY MISSING DATA Why is data missing? Are there patterns to the missing data within or across variables? Imputation methods to consider
56 EXPLORE & MODIFY CLUSTER VARIABLES There is no single answer for clusters Design clusters and profiles around themes using smaller set of related variables
57 SELECTING VARIABLES
58 EXPLORE & MODIFY VARIABLE SELECTION/REDUCTION TECHNIQUES Stepwise Regression Variable Selection Node Decision Tree Node Variable Clustering Combined Approach
59 EXPLORE & MODIFY VARIABLE SELECTION/REDUCTION TECHNIQUES Multicollinearity
60 EXPLORE & MODIFY VARIABLE SELECTION/REDUCTION TECHNIQUES Interactions
61 BEST PRACTICES OPTIMIZING DATA Selecting Target Determining Data Preparing Variables
62 DEVELOPING & DELIVERING THE MODEL
63 MODEL & ASSESS DELIVERING THE MODEL Developing Your Model Choosing a Model Deploying the Model
64 DEVELOPING THE MODEL
65 MODEL MODEL DEVELOPMENT Regression Decision Trees Neural Networks Ensemble Rule Induction Something Else?
66 BEST PRACTICE MODEL DEVELOPMENT Try various techniques and combinations of techniques.
67 CHOOSING A MODEL
68 BEST PRACTICES MODEL SELECTION Evaluate model metrics Consider business knowledge Recognize constraints
69 ASSESS CUMULATIVE CHARTS
70 ASSESS NON-CUMULATIVE CHARTS
71 DEPLOYING THE MODEL
72 BEST PRACTICES MODEL DEPLOYMENT Reporting Results Clean up and back up Monitor performance
73 BEST PRACTICES MODEL DEPLOYMENT Incorporate and share knowledge Automate ETL (Extract, Transform, Load) Automate process
74 ULTIMATE GOAL SOURCE / OPERATIONAL SYSTEMS SAS MODEL FACTORY DATA PREPARATION MODEL DEVELOPMENT MODEL DEPLOYMENT MODEL MANAGEMENT
75 MODEL & ASSESS DELIVERING THE MODEL Developing Your Model Choosing a Model Deploying the Model
76 BEST PRACTICES FORMAT OF PRESENTATION Background & General Guidance Developing the Data Developing & Delivering the Model
77 BEST PRACTICE BE ANALYTICALLY SAVVY AND CREATIVE analytical creative It s both science and art!
78 THANK YOU FOR USING SAS! Copyright 2013 SAS Institute Inc. All rights reserved.