Predictive Analytics and Machine Learning: An Overview

Similar documents
Human Capital Management:

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

SmartCare. SPSS Workshop. Rick Durham - North American Advanced Analytics Channel Team IBM Corporation. Date: 5/28/2014

Analytics for Banks. September 19, 2017

IBM SPSS Predictive Analytics Workshop

Uncover possibilities with predictive analytics

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL

Building the In-Demand Skills for Analytics and Data Science Course Outline

Achieve Better Insight and Prediction with Data Mining

Improve Your Marketing Campaign ROI with Predictive Analytics

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Data Science Training Course

IBM SPSS Modeler Personal

Data analytics in fraud investigations

IBM SPSS Modeler Personal

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

Harnessing Predictive Analytics to Improve Customer Data Analysis and Reduce Fraud

PASW Modeler. Achieve Your Goals with Deep, Predictive Insight

Chapter 13 Knowledge Discovery Systems: Systems That Create Knowledge

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

Enhancing Your "Analytics Quotient" with Enterprise BI & Integrated Predictive Analytics Solutions

Data Mining in CRM THE CRM STRATEGY

Machine Learning 101

Predictive Analytics for Donor Management

IM S5028. Architecture for Analytical CRM. Architecture for Analytical CRM. Customer Analytics. Data Mining for CRM: an overview.

West Midlands Police. Data Driven Insight & Data Science Capability for UK Law Enforcement

Unlocking the Power of Big Data Analytics for Application Security and Security Operation

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Oracle Spreadsheet Add-In for Predictive Analytics for Life Sciences Problems

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

TECH 646 Analysis of Research in Industry and Technology

Chapter 8 Analytical Procedures

Measuring Population Health Management Return on Investment

How to build and deploy machine learning projects

The Insurance Fraud Race Using Information and Analytics to Stay Ahead of Criminals. Copyright 2010, SAS Institute Inc. All rights reserved.

Beyond Rules-Based Tests: The "Seven Wonder" Counter-Fraud Statistical Modeling Tests. Vincent Walden, CFE, CPA, CITP

THE NEXT WAVE WEBSITE AND PERSONALIZATION. Informs Presentation January 13, 2010

MANAGING NEXT GENERATION ARTIFICIAL INTELLIGENCE IN BANKING

Segmentation Successes. February 12, 2014

Louis Bodine IBM STG WW BAO Tiger Team Leader

IBM SPSS Decision Trees

Operational Application of Targeted Data Analysis. Eric Chasin, NTELX Agenda. What we do?

SolidQ Data Science Services Fraud Detection

IBM BW Lead Scoring, Product Recommendation, Retention Solution

Chart your future with predictive analytics

The State of Fraud in Government

DATA DRIVES GOVERNMENT

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Building A Holistic and Risk-Based Insider Threat Program

IQMEN - BUSINESS INTELLIGENCE

IQMEN - BUSINESS INTELLIGENCE

Predictive Analytics for the Business Analyst. Fern Halper July 8,

GSAW 2018 Machine Learning

Data Warehousing. and Data Mining. Gauravkumarsingh Gaharwar

Democratising Predictive & Embedded Analytics. Clinton Etheridge Senior Pre-Sales Consultant

Achieving customer intimacy with IBM SPSS products

10/26/17. Big Data: An Accountant s New Best Friend. Management Information Systems. Big Data Analysis. Parts I and II November 10, 2017

Reducing Compliance Risk by Automating Supervision

part 1 Marketing Research and the Research Process 1 part 5 Data Analysis and Interpretation 349 2, Determine Research Design 57

IBM Briefing Hva med 2017?

IBM Kenexa Talent Insights

Simple Reward Practices for Better Business Results

Getting Started with Machine Learning

Data Mining Technology and its Future Prospect

Getting Started with Predictive Analytics

DONALD K. WEDDING, P.E., PH.D ANDOVER DR. TWINSBURG, OH (330) SUMMARY

The Analytical Revolution

INSIGHTS PREDICTIVE SCORES

CHAPTER - V JOB SATISFACTION AND OCCUPATIONAL STRESS

Manufacturing Real cases of how manufacturers use analytics to increase their bottom line

GOVERNMENT ANALYTICS LEADERSHIP FORUM SAS Canada & The Institute of Public Administration of Canada. April 26 The Shaw Centre

SUMMARY, FINDINGS, CONCLUSION AND SUGGESTIONS

Let your Data Drive your Future with Predictive Analytics

Optum Performance Analytics

BIOSTATISTICS AND MEDICAL INFORMATICS (B M I)

SAS Business Knowledge Series

Casino Feasibility Study Visitor Segmentation Phase

Optum Performance Analytics

DATA ANALYTICS WITH R, EXCEL & TABLEAU

Using People Analytics to Help Prevent Absences Due to Mental Health Issues

Consumer buying behavior of Durable goods

INTRODUCTION TO INFORMATION SYSTEMS & DECISION MAKING

Deep Dive into High Performance Machine Learning Procedures. Tuba Islam, Analytics CoE, SAS UK

Business Intelligence, Analytics, and Predictive Modeling Kim Gaddy

Innovation in Proactive Mental Health Management. UPMC WorkPartners

Safety Perception / Cultural Surveys

RESEARCH SPOTLIGHT. How PREDICTIVE ANALYTICS Can Improve Your Recruiting Outcomes

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Thrive in a changing market with analytics as a core competency

Text Analytics for Executives Title

BIG DATA IN BANKING Pravin Burra, Sept 2018

Gene Expression Data Analysis

Digital Finance in Shared Services & GBS. Deloitte: Piyush Mistry & Oscar Hamilton LBG: Steve McKenna

CSC-272 Exam #1 February 13, 2015

Fraud and the Small Business Owner

IS Today: Managing in a Digital World 9/17/12

Finding Hidden Intelligence with Predictive Analysis of Data Mining

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 1.852

University of Toronto / Dumas Research Studies

Transcription:

Predictive Analytics and Machine Learning: An Overview Bill Haffey April 27, 2012 Business Analytics software

Statistical Analysis and Machine Learning Statistical Analysis Confirm Hypotheses Data Requirements More Assumptions Design importance General Population Predictions User Driven Machine Learning Generate Hypotheses Exploratory Less Data Prep Fewer Assumptions Individual Predictions Results Oriented Data Driven

Statistics Use Case Examples Used often in experimental design, clinical trials and survey research with complex sampling designs N.O.R.C. and Gallup use extensive inferential statistics accurately representing survey data on how people think and feel about the world today. NIH uses inferential statistics to analyze clinical data to reveal significant differences in treatments and interventions. CDC extensive epidemiological studies require inferential statistics Used to create data when you don t have it: The data is expensive Sample size What types? Sample, infer 2009 SPSS Inc. 3

In a nutshell Machine Learning works by Clearly defining business goals Data exploration: Discovery of patterns hypothesis generation Weighting of important inputs (With some algorithms) dismissal of non-influential factors Training/Refining/Validation of Models Reliance on domain/data expertise, rather than analytical skills Model deployment Model export in common formats (.xml,.pmml) Automated update

Machine Learning Three classes of algorithms Supervised vs. unsupervised Complementary What events occur together? Given a series of actions; what action is likely to occur next? Associate Patterns Cluster Differences Data Mining Predict who is likely to exhibit specific behavior in the future. Group cases that exhibit similar characteristics. Predict Relationships 2007 SPSS Inc. 5

Supervised Learning: Profile and Predict Build a predictive profile of the historical outcome using a collection of potential input fields. Credit ranking (1=default) Cat. % n Bad 52.01 168 Good 47.99 155 Total (100.00) 323 Paid Weekly/Monthly P-value=0.0000, Chi-square=179.6665, df=1 Weekly pay Monthly salary Cat. % n Cat. % n Bad 86.67 143 Bad 15.82 25 Good 13.33 22 Good 84.18 133 Total (51.08) 165 Total (48.92) 158 Age Categorical Age Categorical P-value=0.0000, Chi-square=30.1113, df=1 P-value=0.0000, Chi-square=58.7255, df=1 Young (< 25);Middle (25-35) Old ( > 35) Young (< 25) Middle (25-35);Old ( > 35) We will Supervise the learning process as the algorithm attempts to model the outcomes using provided inputs Explores all combinations, interactions and contingencies. Use this profile to understand and predict future cases. Cat. % n Bad 90.51 143 Good 9.49 15 Total (48.92) 158 Cat. % n Cat. % n Cat. % n Bad 0.92 1 Bad 0.00 0 Bad 48.98 24 Good 99.08 108 Good 100.00 7 Good 51.02 25 Total (33.75) 109 Total (2.17) 7 Total (15.17) 49 Social Class P-value=0.0016, Chi-square=12.0388, df=1 Management;Clerical Professional Cat. % n Cat. % n Bad 58.54 24 Bad 0.00 0 Good 41.46 17 Good 100.00 8 Total (12.69) 41 Total (2.48) 8 6 2009 SPSS Inc.

Profile and Predict Neural Networks A technique for predicting outcomes based on inputs, where the inputs are weighted on hidden layers Behaves similar to the neurons in your brain Back-propagation to adjust weights based on hits/misses on training data Requires minimal statistical or mathematical knowledge 7 2009 SPSS Inc.

Neural Network Anatomy 8

Neural Network Output 9

Neural Network Summary Excellent for modeling complex relationships and predicting outcomes Can handle nonlinearity and interactions with ease Good for solving many different problem sets (categorical, binary, scale predictors and outcomes) Very poor (Black Box) at describing the relationships among predictors and outcomes 10

Profile and Predict Decision Trees and Rule Induction Classification systems that predict or classify Technique that shows the reasoning contrast with Neural Network Builds sets of easy to understand If Then Rules Eliminates factors that are unimportant 11

Start Here: Claim Amount Decision Tree Output 12

Decision Trees Excellent at uncovering and modeling complex relationships Very accurate on even small data sets to inform decision making. Can handle nonlinear relationships with complex interactions. Very easy to understand and describe to others. Time to insight in minutes. 13

What is Unsupervised Learning? A Machine Learning technique useful when we do not know the output or outputs Can be thought of as finding useful patterns above and beyond noise or fishing for information

Cluster and Associate Clustering An exploratory data analysis technique Reveals natural groups within a data set No prior knowledge about groups or characteristics Large groups interesting, but so are small groups Not always an end in itself Associations Finds things that occur together ex: events in a crime incident Associations can exist between any of the attributes: (no single outcome like Decision Trees) Sequential Associations Discovers association rules in time-oriented data Find the sequence or order of the events 15 2009 SPSS Inc.

What kinds of things can you do with Machine Learning in the Public Sector/Federal? Manage Human Capitol Thwart Insider Threat Detect Fraud Clean up the Streets

IBM SPSS Text Mining Human Capital Management Business Analytics software

Case Study: U.S. Army Reserve - OCAR Challenge Reduce and determine reasons for reserve attrition Reserve soldiers have careers and responsibilities outside of the U.S. Army, making high attrition rates an ongoing challenge. Need to determine the characteristics that lead to attrition and the types and levels of incentives that can aid in retaining a soldier Solution IBM SPSS Modeler SPSS Modeler used to classify soldiers at risk of attrition, including the analysis of military occupational skills (MOS) in classifying attrition SPSS Modeler to create models for incentive planning. Benefits Predicted attrition using demographic data for army reservists. Created a predictive model to analyze why reservists leave and used this model for scoring the possibility for attrition of candidates on a weekly basis. Modeled the soldier incentive types and levels that would minimize cost and attrition

Retention Modeling Process Likelihood of Success If Experience = Info Systems And Education = Undergrad And Years Working <= 5 And Communication Skills > 7 Then Success = Medium(35, 0.78) Current Employees (Education, job history, experience, demographics) Current Data Likelihood to Separate If Education= Post Graduate And Years Working >= 7 And used travel (sentiment NEGATIVE) And Commute >= 30mins Then Leave = YES (94, 0.927) Identify characteristics of employee success and attrition / (dis)satisfaction Payroll (Comp plans, salary) Retention Incentives 1. Salary Increase, prob 0.23 2. Not applicable 3. Flexible Schedule, prob 0.87 4. PerformanceAward, prob 0.36 5. Benefits, prob 0.54 Managers reports on employee satisfaction and performance Survey Data (Attitudes, non work related factors) Data Collection SPSS 2009

Retention Modeling Process Likelihood of Success If Experience = Info Systems And Education = Undergrad And Years Working <= 5 And Communication Skills > 7 Then Success = Medium(35, 0.78) Current Employees (Education, job history, experience, demographics) Likelihood to Separate If Education= Post Graduate And Years Working >= 7 And used travel (sentiment NEGATIVE) And Commute >= 30mins Then Leave = YES (94, 0.927) Predictive Modeling Retention Incentives 1. Salary Increase, prob 0.23 2. Not applicable 3. Flexible Schedule, prob 0.87 4. PerformanceAward, prob 0.36 5. Benefits, prob 0.54 Identify characteristics of employee success and attrition / (dis)satisfaction Managers reports on employee satisfaction and performance Payroll (Comp plans, salary) Survey Data (Attitudes, non work related factors) SPSS 2009

Retention Modeling Process Likelihood of Success If Experience = Info Systems And Education = Undergrad And Years Working <= 5 And Communication Skills > 7 Then Success = Medium(35, 0.78) Current Employees (Education, job history, experience, demographics) Likelihood to Separate If Education= Post Graduate And Years Working >= 7 And used travel (sentiment NEGATIVE) And Commute >= 30mins Then Leave = YES (94, 0.927) Identify characteristics of employee success and attrition / (dis)satisfaction Payroll (Comp plans, salary) Decision Optimization Retention Incentives 1. Salary Increase, prob 0.23 2. Not applicable 3. Flexible Schedule, prob 0.87 4. PerformanceAward, prob 0.36 5. Benefits, prob 0.54 Managers reports on employee satisfaction and performance Survey Data (Attitudes, non work related factors) SPSS 2009

Medicaid Fraud: Trust Solutions CMS Case Study Business Analytics software

Case Study Anita Nurse Apriori results showed Anita as being one of the patients billed for services from various HHAs Therapy at Home HHA Friendly Therapy & Fun HHA Comfortable Quarters HHA Rehabilitation & Recreation HHA The output does not show the frequency by which Anita moved back-and-forth

Case Study Anita Nurse Therapy At Home Friendly Therapy & Fun Further data analysis revealed that Anita actually moved among these 4 providers in an 18 month period! Given the nature of HHA, TS clinicians and investigators felt that this was unusual Shows Anita moved among these providers but not the order Comfortable Quarters Rehabilitation & Recreation

Sequencing Algorithm Specific type of association technique where it is not only important that a relationship exists, but the order of events is of interest Sequencing rules are used to answer questions like: Does the pattern of a purchase predict the future purchase of another item? Do people buy health club memberships after visiting a doctor? Does a patient visit provider B after visiting provider A?

Case Study Ben Feelinsick Apriori results had shown that Ben had been billed for services by 2 different providers: Healing Nurses HHA Therapy at Home HHA Initially, this might not be of interest, or not as much as Anita Examining the data further using sequencing, a suspicious pattern emerges!

Case Study Ben Feelinsick Therapy at Home 3/13/2000 Dx 340 (MS) 2/14/2000 Dx 340 (MS) 2/28/2000 Dx 5990 (Ur Trct Inf) 2/19/2000 Dx 340 (MS) 3/12/2000 3/19/2000 Dx 5990 (Ur Trct Inf) Dx 5990 (Ur Trct Inf) In only a 4 month span, Ben moved between these 2 providers a total of 7 times How does this pattern benefit the patient? It is suspicious that the patient moves so much during this short time 3/27/2000 Dx 340 (MS) 4/21/2000 Dx 5990 (Ur Trct Inf) Healing Nurses

January 12, 2012 Law Enforcement Business Analytics software

Law Enforcement Problem: Spiraling crime rates, limited officer resources -- better deployment decisions required Solve: (In addition to incident data) weather, city events, holiday/payday cycles, etc better picture of criminal incidents, more accurate prediction, more effective deployment

Model mechanics law enforcement scenario Night Day NPD PD/PD+1 Clear Rain N3D 3D 30

31

Insider Threat Detection and Analysis Business Analytics software

What is an insider threat? A current or former employee, contractor, or business partner who: has or had authorized access to an organization s network, system, or data and intentionally exceeded or misused that access in a manner that negatively affected the confidentiality, integrity, or availability of the organization s information or information systems Source: U.S. CERT 33

Insider Threat Analysis Use Case Common Data Environment: Audit data network and server logs, files accessed, emails and content, employee demographics Large volumes Disparate sources Different data formats - structured and unstructured Using Machine Learning: Merge and exploit data from all sources using all relevant data attributes Model normality to identify anomalous behavior Trend/ Predict which employee is not behaving like peers 34

What is Normal? Baseline Activity Including resource usage, work hours, document type Used to baseline activity of employees against: Their own past history The past history of their peers (job title, department, project) Change in Cluster Membership Used for both Reactive and Proactive Analysis Spikes in Activity Reversals in Trends

Reactive Analysis A K-Nearest Neighbor algorithm is used to easily identify employees whose behavior closely matches that of the person being audited. other Segmentation algorithms and Association algorithms are also used to group people based on behavior patterns 36

Proactive Analysis Analysis of documents accessed by employees and how closely each person is associated to certain topics of interest Most of the work done within proactive analysis is used to contribute to an individual s risk score or to create a model to classify the likely risk for that individual. 37

Thanks? 38