MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge

Similar documents
Logistic Regression and Decision Trees

SUPPLEMENTARY INFORMATION

Predicting Corporate Influence Cascades In Health Care Communities

3 Ways to Improve Your Targeted Marketing with Analytics

Credibility: Evaluating What s Been Learned

Weka Evaluation: Assessing the performance

RESULT AND DISCUSSION

Verification of Method XXXXXX

Evaluation next steps Lift and Costs

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Feature Selection in Pharmacogenetics

Purchase Predicting with Clickstream Data

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

Keywords acrm, RFM, Clustering, Classification, SMOTE, Metastacking

Getting Started with Predictive Analytics

Applying an Interactive Machine Learning Approach to Statutory Analysis

Big Data. Methodological issues in using Big Data for Official Statistics

Overview of Statistics used in QbD Throughout the Product Lifecycle

Machine Learning Techniques For Particle Identification

Diagnostics. Using Diagnostic Tests. Test Sensitivity and Specificity. Dr. Randall Singer Professor of Epidemiology

Clinical Data Architecture for Business Intelligence and Quality Reporting

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data

DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues

Ontario s Narcotics Strategy

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Chapter 5 Evaluating Classification & Predictive Performance

Quadratic Regressions Group Acitivity 2 Business Project Week #4

GETTING STARTED WITH PROC LOGISTIC

Assay Validation Services

Advanced analytics at your hands

A Comparative Study of Filter-based Feature Ranking Techniques

Estimating Discrete Choice Models of Demand. Data

OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP

Model Selection, Evaluation, Diagnosis

Credit Card Marketing Classification Trees

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Industrial Organization Field Exam. May 2013

Harbingers of Failure: Online Appendix

Cost-effectiveness and cost-utility analysis accompanying Cancer Clinical trials. NCIC CTG New Investigators Workshop

New Customer Acquisition Strategy

Chapter 10 Regression Analysis

GETTING STARTED WITH PROC LOGISTIC

Startup Machine Learning: Bootstrapping a fraud detection system. Michael Manapat

An empirical machine learning method for predicting potential fire control locations for pre-fire planning and operational fire management

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations

Achieving World Class Safety Performance Through Metrics

Propensity of Contract Renewals

Logistic Regression for Early Warning of Economic Failure of Construction Equipment

Linear model to forecast sales from past data of Rossmann drug Store

Analysing Players in Online Games

MEANINGFUL USE CRITERIA PHYSICIANS

Phase Appropriate Validation Design for Potency Assays

1201 Maryland Avenue SW, Suite 900, Washington, DC ,

Chapter 2 Tools of Positive Analysis

An Introduction to Data Analytics For Software Security

Introduction to Machine Learning for Longitudinal Medical Data

Quality Control Assessment in Genotyping Console

Outline. Quality of Data Linkage Megan Bohensky August Background Quality in Data Linkage Study Example Reporting Standards.

[Type the document title]

Frost & Sullivan independent analysis indicates that the market is slowly being eroded by several factors listed below.

Timing Production Runs

H.I. Farag 1, A.H. Hashish 2, M.M. Mourad 3, M.M.M. El-Nasharty 1

Advanced Tutorials. SESUG '95 Proceedings GETTING STARTED WITH PROC LOGISTIC

Breakthrough Denials Performance: Leveraging Analytics and Optimizing Upfront Success

YASHAJIT SAHA & ABHISHEK SHARMA, SUBJECT MATTER EXPERTS, RESEARCH & ANALYTICS ADVANCED ANALYTICS: A REMEDY FOR COMMERCIAL SUCCESS IN PHARMA.

IMPROVING PREDICTIVE VALUE OF INDICATORS OF POOR PERFORMANCE Patricia Fazio Bronson and David Sparrow

Assessing the Accuracy of Legal Implementation Readiness Decisions

An Assessment of Deforestation Models for Reducing Emissions from Deforestation & forest Degradation (REDD)

APIXIO HCC PROFILER Reimagining Risk Adjustment

Equipment and preparation required for one group (2-4 students) to complete the workshop

Chapter 6: Customer Analytics Part II

HEALTHCARE SOLUTIONS UNIPICK 2 AUTOMATED PACK DISPENSING SYSTEM

Analytical Modeling: A Technique for Handling Big (and small) Data

An Evidence-Based Approach to Hiring

Re: Docket No. FDA 2005-D-0339: Draft Guidance on Drug Safety Information FDA's Communication to the Public

Learning from the Observational Medical Outcomes Partnership (OMOP)

Application of Machine Learning to Financial Trading

Prediction of Road Accidents Severity using various algorithms

Decision Analytic Thinking. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Economy

Performance Measures & MetricsBased Management

Accurate Campaign Targeting Using Classification Algorithms

Overview. Presenter: Bill Cheney. Audience: Clinical Laboratory Professionals. Field Guide To Statistics for Blood Bankers

A spec i a l r e p o r t w ri t t e n i n a s s o c i at i o n w i t h I M S

Multivariate Logistic Regression Prediction of Fault-Proneness in Software Modules

Binary logistic regression modelling in predicting channel behaviour towards instant coffee vending machines

Drive Better Insights with Oracle Analytics Cloud

CREDIT RISK MODELLING Using SAS

A News-Based Portfolio Management System

Why is Healthcare Stuck in Reporting?

CHAPTER 8 APPLICATION OF CLUSTERING TO CUSTOMER RELATIONSHIP MANAGEMENT

Feature Selection for Predictive Modelling - a Needle in a Haystack Problem

Population Health Provider Boosts Quality and Customer Satisfaction for Member Health Solutions

Consumer Choice and Demand. Chapter 9

Using Your Prescription. Drug Program. Answers to important questions about retail pharmacy and mail order purchasing

Five Drugs With Price Increases During 2016

Patient persistency to prescribed medications,

Medicare Data Solutions RFP FAQ

Logistic Regression with Expert Intervention

USE OF PRIMARY CARE DATABASES IN EPIDEMIOLOGIC RESEARCH

Homework 4. Due in class, Wednesday, November 10, 2004

Transcription:

MODELING THE EXPERT An Introduction to Logistic Regression 15.071 The Analytics Edge

Ask the Experts! Critical decisions are often made by people with expert knowledge Healthcare Quality Assessment Good quality care educates patients and controls costs Need to assess quality for proper medical interventions No single set of guidelines for defining quality of healthcare Health professionals are experts in quality of care assessment 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

Experts are Human Experts are limited by memory and time Healthcare Quality Assessment Expert physicians can evaluate quality by examining a patient s records This process is time consuming and inefficient Physicians cannot assess quality for millions of patients 15.071x Modeling the Expert: An Introduction to Logistic Regression 2

Replicating Expert Assessment Can we develop analytical tools that replicate expert assessment on a large scale? Learn from expert human judgment Develop a model, interpret results, and adjust the model Make predictions/evaluations on a large scale Healthcare Quality Assessment Let s identify poor healthcare quality using analytics 15.071x Modeling the Expert: An Introduction to Logistic Regression 3

Claims Data Medical Claims Diagnosis, Procedures, Doctor/Hospital, Cost Pharmacy Claims Drug, Quantity, Doctor, Medication Cost Electronically available Standardized Not 100% accurate Under-reporting is common Claims for hospital visits can be vague 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

Creating the Dataset Claims Samples Claims Sample Large health insurance claims database Randomly selected 131 diabetes patients Ages range from 35 to 55 Costs $10,000 $20,000 September 1, 2003 August 31, 2005 15.071x Modeling the Expert: An Introduction to Logistic Regression 2

Creating the Dataset Expert Review Claims Sample Expert Review Expert physician reviewed claims and wrote descriptive notes: Ongoing use of narcotics Only on Avandia, not a good first choice drug Had regular visits, mammogram, and immunizations Was given home testing supplies 15.071x Modeling the Expert: An Introduction to Logistic Regression 3

Creating the Dataset Expert Assessment Claims Sample Rated quality on a two-point scale (poor/good) Expert Review Expert Assessment I d say care was poor poorly treated diabetes No eye care, but overall I d say high quality 15.071x Modeling the Expert: An Introduction to Logistic Regression 4

Creating the Dataset Variable Extraction Claims Sample Expert Review Expert Assessment Dependent Variable Quality of care Independent Variables ongoing use of narcotics only on Avandia, not a good first choice drug Had regular visits, mammogram, and immunizations Was given home testing supplies Variable Extraction 15.071x Modeling the Expert: An Introduction to Logistic Regression 5

Creating the Dataset Variable Extraction Claims Sample Expert Review Expert Assessment Variable Extraction Dependent Variable Quality of care Independent Variables Diabetes treatment Patient demographics Healthcare utilization Providers Claims Prescriptions 15.071x Modeling the Expert: An Introduction to Logistic Regression 6

Predicting Quality of Care The dependent variable is modeled as a binary variable 1 if low-quality care, 0 if high-quality care This is a categorical variable A small number of possible outcomes Linear regression would predict a continuous outcome How can we extend the idea of linear regression to situations where the outcome variable is categorical? Only want to predict 1 or 0 Could round outcome to 0 or 1 But we can do better with logistic regression 15.071x Modeling the Expert: An Introduction to Logistic Regression 7

Logistic Regression Predicts the probability of poor care Denote dependent variable PoorCare by y h P (y = 1) Then ) P (y = 0) = 1 P (y = 1) Independent variables x 1,x 2,...,x k Uses the Logistic Response Function 1 P (y = 1) = 1+e ( 0+ 1 x 1 + 2 x 2 +...+ k x k ) Nonlinear transformation of linear regression equation to produce number between 0 and 1 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

Understanding the Logistic Function P (y = 1) = 1 1+e ( 0+ 1 x 1 + 2 x 2 +...+ k x k ) Positive values are predictive of class 1 Negative values are predictive of class 0 P(y = 1) -4-2 0 2 4 15.071x Modeling the Expert: An Introduction to Logistic Regression 2

Understanding the Logistic Function P (y = 1) = 1 1+e ( 0+ 1 x 1 + 2 x 2 +...+ k x k ) The coefficients are selected to Predict a high probability for the poor care cases Predict a low probability for the good care cases 15.071x Modeling the Expert: An Introduction to Logistic Regression 3

Understanding the Logistic Function P (y = 1) = 1 1+e ( 0+ 1 x 1 + 2 x 2 +...+ k x k ) We can instead talk about Odds (like in gambling) Odds = P (y = 1) P (y = 0) Odds > 1 if y = 1 is more likely Odds < 1 if y = 0 is more likely 15.071x Modeling the Expert: An Introduction to Logistic Regression 4

The Logit It turns out that log(odds) = 0 + 1 x 1 + 2 x 2 +...+ k x k This is called the Logit and looks like linear regression The bigger the Logit is, the bigger P (y = 1) 15.071x Modeling the Expert: An Introduction to Logistic Regression 5

Model for Healthcare Quality Plot of the independent variables Number of Office Visits Number of Narcotics Prescribed Red are poor care Green are good care 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

Threshold Value The outcome of a logistic regression model is a probability Often, we want to make a binary prediction Did this patient receive poor care or good care? We can do this using a threshold value t If P(PoorCare = 1) t, predict poor quality If P(PoorCare = 1) < t, predict good quality What value should we pick for t? 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

Threshold Value Often selected based on which errors are better If t is large, predict poor care rarely (when P(y=1) is large) More errors where we say good care, but it is actually poor care Detects patients who are receiving the worst care If t is small, predict good care rarely (when P(y=1) is small) More errors where we say poor care, but it is actually good care Detects all patients who might be receiving poor care With no preference between the errors, select t = 0.5 Predicts the more likely outcome 15.071x Modeling the Expert: An Introduction to Logistic Regression 2

Selecting a Threshold Value Compare actual outcomes to predicted outcomes using a confusion matrix (classification matrix) Predicted = 0 Predicted = 1 Actual = 0 True Negatives (TN) False Positives (FP) Actual = 1 False Negatives (FN) True Positives (TP) 15.071x Modeling the Expert: An Introduction to Logistic Regression 3

Receiver Operator Characteristic (ROC) Curve True positive rate (sensitivity) on y-axis Proportion of poor care caught False positive rate (1-specificity) on x-axis Proportion of good care labeled as poor care True positive rate Receiver Operator Characteristic Curve False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

Selecting a Threshold using ROC Captures all thresholds simultaneously High threshold High specificity Low sensitivity Low Threshold Low specificity True positive rate Receiver Operator Characteristic Curve High sensitivity False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 2

Selecting a Threshold using ROC Choose best threshold for best trade off cost of failing to detect positives costs of raising false alarms True positive rate Receiver Operator Characteristic Curve False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 3

Selecting a Threshold using ROC Choose best threshold for best trade off cost of failing to detect positives costs of raising false alarms True positive rate Receiver Operator Characteristic Curve 0.07 0.25 0.44 0.63 0.81 1 False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 4

Selecting a Threshold using ROC Choose best threshold for best trade off cost of failing to detect positives costs of raising false alarms True positive rate Receiver Operator Characteristic Curve 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.1 0 0.07 0.25 0.44 0.63 0.81 1 False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 5

Interpreting the Model Multicollinearity could be a problem Do the coefficients make sense? Check correlations Measures of accuracy 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

Compute Outcome Measures Confusion Matrix: Predicted Class = 0 Predicted Class = 1 Actual Class = 0 True Negatives (TN) False Positives (FP) Actual Class = 1 False Negatives (FN) True Positives (TP) N = number of observations Overall accuracy = ( TN + TP )/N Overall error rate = ( FP + FN)/N Sensitivity = TP/( TP + FN) False Negative Error Rate = FN/( TP + FN) Specificity = TN/( TN + FP) False Positive Error Rate = FP/( TN + FP) 15.071x Modeling the Expert: An Introduction to Logistic Regression 2

Making Predictions Just like in linear regression, we want to make predictions on a test set to compute out-of-sample metrics > predicttest = predict(qualitylog, type= response, newdata=qualitytest) This makes predictions for probabilities If we use a threshold value of 0.3, we get the following confusion matrix Predicted Good Care Predicted Poor Care Actually Good Care 19 5 Actually Poor Care 2 6 15.071x Modeling the Expert: An Introduction to Logistic Regression 3

Area Under the ROC Curve (AUC) Just take the area under the curve Interpretation Given a random positive and negative, proportion of the time you guess which is which correctly Less affected by sample balance than accuracy True positive rate Receiver Operator Characteristic Curve AUC = 0.775 False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 4

Area Under the ROC Curve (AUC) What is a good AUC? Maximum of 1 (perfect prediction) True positive rate Receiver Operator Characteristic Curve False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 5

Area Under the ROC Curve (AUC) What is a good AUC? Maximum of 1 (perfect prediction) Minimum of 0.5 (just guessing) True positive rate Receiver Operator Characteristic Curve False positive rate 15.071x Modeling the Expert: An Introduction to Logistic Regression 6

Conclusions An expert-trained model can accurately identify diabetics receiving low-quality care Out-of-sample accuracy of 78% Identifies most patients receiving poor care In practice, the probabilities returned by the logistic regression model can be used to prioritize patients for intervention Electronic medical records could be used in the future 15.071x Modeling the Expert: An Introduction to Logistic Regression 1

The Competitive Edge of Models While humans can accurately analyze small amounts of information, models allow larger scalability Models do not replace expert judgment Experts can improve and refine the model Models can integrate assessments of many experts into one final unbiased and unemotional prediction 15.071x Modeling the Expert: An Introduction to Logistic Regression 2