Efficient Use of EHR Data for Translational Research

Similar documents
Big Data & Clinical Informatics

Predictive Genomics for Prevention, Interception and Cure Health Datapalooza

Transformational Data-Driven Solutions for Healthcare

Bridging the Genomics-Health IT Gap for Precision Medicine

Is Your Healthcare Analytics System Missing the Gold in Your EHRs? A Special Report for Healthcare Executives

TOTAL CANCER CARE: CREATING PARTNERSHIPS TO ADDRESS PATIENT NEEDS

emerge-ii site report Vanderbilt

Human Genomics, Precision Medicine, and Advancing Human Health. The Human Genome. The Origin of Genomics : 1987

Application of Deep Learning to Drug Discovery

Optum Performance Analytics

Personalized Medicine

Reimagining Life Sciences With AI-Enabled Digital Transformation. Abstract

Technology Trends and Impacts on CDI Programs. Tim Minnich, Solution Sales Executive, Mobile:

Introducing a Highly Integrated Approach to Translational Research: Biomarker Data Management, Data Integration, and Collaboration

The Future of HealthCare Information Technology

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

Epidemiology in the era of digital data. Andrew Roddam September 2017

SEMANTIC DATA PLATFORM FOR HEALTHCARE. Dr. Philipp Daumke

Application of Deep Learning to Drug Discovery

Optum Labs: A Center for Collaborative Healthcare Research and Innovation

Pharmacovigilance & Signal Detection

New Frontiers in Personalized Medicine

Should IGRAs replace the TST?

The Four Eras of Analytics. Thomas H. Davenport

Semantic Enrichment and the Information Manager

Automated Coding Software:

Dana-Farber Cancer Institute Speeds Medical Research with Advanced Data Warehouse

QIAN ZHU. Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA CUI TAO

Analytics in Healthcare. Preparing for advance healthcare analytics

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

UTILIZING REAL WORLD BIG DATA AND VISUAL ANALYTICS TO EXAMINE THE PATIENT JOURNEY

Types of Databases - By Scope

APIXIO HCC PROFILER Reimagining Risk Adjustment

Healthcare Transformation 2016 ACO Summit Charlie Lougheed President & Cofounder, Explorys, an IBM Company.

Developing Data Models and Standards to Support Use Cases

DESIGN OF COPD BIGDATA HEALTHCARE SYSTEM THROUGH MEDICAL IMAGE ANALYSIS USING IMAGE PROCESSING TECHNIQUES

Personalized. Health in Canada

2017 Precision Medicine Study

Centricity * Precision Reporting

Making the Case for a New Approach to Managing Quality and Risk: Intelligence Automation in the Ambulatory Care Setting

Leveraging Data Analytics for Customer Support Efficiency

Comments on Use of Databases for Establishing the Clinical Relevance of Human Genetic Variants

Research Opportunities at NIGMS. Stephen Marcus, Ph.D.

Controversy In Pharmacogenomics

Clinical Applications of Big Data

State of the Art in Data Management for Precision Medicine & Genomics. March 8, pm 3 pm ET

VITERA (GREENWAY) INTERGY 9.0 MEDICAL REVIEW

Powering the Connected Healthcare Ecosystem

Testimony of Christopher Newton-Cheh, MD, MPH Volunteer for the American Heart Association

Privacy Preserving Data Mining in Application

Machine Learning in Computational Biology CSC 2431

Increased competitive advantage and enhanced overall revenue through automated data collection and user-friendly reporting tools

NEXT GENERATION PREDICATIVE ANALYTICS USING HP DISTRIBUTED R

Big Data. Methodological issues in using Big Data for Official Statistics

Pathway to Meaningful Use: Unlocking Valuable Information in Unstructured Documents

Informatics of Clinical Genomics

Leveraging the Electronic Health Record for Population Decision Support and Quality Measurement. Jonathan S. Einbinder, MD, MPH

MEDHOST Emergency Department Information System

UltiPro Perception Collect and understand employee feedback with surveys and sentiment analysis

The IBM Reference Architecture for Healthcare and Life Sciences

A Non-Actuarial Look at Predictive Analytics in Health Insurance Past, Present and Future. November 2016 Rajiv Sood

Approaching an Analytical Project. Tuba Islam, Analytics CoE, SAS UK

Short Course: Adaptive Clinical Trials

GeriMedProfiles. Consultant Pharmacist Software

Terminology for personalized medicine

PAREXEL GENOMIC MEDICINE SERVICES. Applying genomics to enhance your drug development journey

Flexible NLP for Varied Applications and Data Sources. David Milward, PhD CTO, Linguamatics

Forecasting Capabilities

SAS Business Knowledge Series

Data Collection and Aggregation: Making It Work for Your P4P Program

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Translational Research

Genomic Research: Issues to Consider. IRB Brown Bag August 28, 2014 Sharon Aufox, MS, LGC

ELE4120 Bioinformatics. Tutorial 5

Value of. Clinical and Business Data Analytics for. Healthcare Payers NOUS INFOSYSTEMS LEVERAGING INTELLECT

SMEs in IMI2 Calls for Proposals

MDM offers healthcare organizations an agile, affordable solution To deliver high quality patient care and better outcomes

Enterprise Data An Untapped Asset for Succeeding as Healthcare Changes

Working with Health IT Systems is available under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported license.

Big Data for Government Symposium

Emerging Impacts on Artificial Intelligence on Healthcare IT Session 300, February 20, 2017 James Golden, Ph.D., Christopher Ross, MBA

CDISC Tech Webinar Leveraging CDISC Standards to Drive Crosstrial Analytics; Graph Technology and A3 Informatics 26 OCT 2017

Stefano Monti. Workshop Format

NHS ENGLAND BOARD PAPER

Innovations in Clinical

Optum. One. Award-winning intelligent health analytics platform

Overview of Health Informatics. ITI BMI-Dept

Health Story Project: Making the Move to the EHR How to Cut the Paper Clutter

HL 7 FHIR and the Future of Interoperability. January 25, pm 3pm ET

The MSO (Management Services Organization) Concept

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Clinical trial information leaflet and consent

Analytics of Biomedical Data INFO B585

ICH Topic E16 Genomic Biomarkers Related to Drug Response: Context, Structure and Format of Qualification Submissions. Step 3

Machine Learning in Pharmaceutical Research

Content Areas of the Pharmacy Curriculum Outcomes Assessment (PCOA )

Moving UDI from Regulation to Value Within Hospitals and Beyond

Transcription:

Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan School of Public Health Department of Biomedical Informatics, Harvard Medical School Harvard Catalyst, 2018

Acknowledgment Jessica Gronsbell (Stanford) Hong Chuan (Harvard) Sheng Yu (Tsinghua University) David Cheng (Harvard) Abhishek Chakrabortty (Upenn) Issac Kohane (Harvard) Vivian Gainer (Partner s Healthchare) Victor Castro (Partner s Healthcare) Shawn Murphy (Partner s Healthcare) Ashwin Ananthakrishnan (MGH) Katherine Liao (BWH) NIH BD2K grant (Harvard Catalyst, 2018) EHR Research 2 / 18

Outline Opportunities & Challenges in using EHR for research Phenomewide Association Study (PheWAS) Genetic Risk Prediction Comparative Effectiveness Research/Causal Inference Efficient Phenotyping via Semi-supervised Learning (SSL) SSL Approach to Genetic Risk Modeling Remarks (Harvard Catalyst, 2018) EHR Research 3 / 18

Background EHR adoption rate Rich resource for research detailed longitudinal patient level data a wide range of disease conditions enables large scale genomic & comparative effectiveness studies Source: CDC NCHS Data Brief (2014) (Harvard Catalyst, 2018) EHR Research 4 / 18

EHR Data structured data: ICD9 billing codes; lab results etc unstructured text data: extracted via natural language processing (NLP) clinical term concept unique identifiers (CUI) [Liao et al, 2015] (Harvard Catalyst, 2018) EHR Research 5 / 18

Integrative Analysis of Electronic Medical Records (EMR) Data EHR linked with bio-respository PheWAS Bio-repository Genomic Risk Prediction of Disease EMR Comparative Effective Research Pharmacogenomics (Harvard Catalyst, 2018) EHR Research 6 / 18

Integrative Analysis of Electronic Medical Records (EMR) Data EHR linked with bio-respository PheWAS Bio-repository Genomic Risk Prediction of Disease EMR Comparative Effective Research Pharmacogenomics A Major Challenge Precise info on phenotype/treatment response not readily available ICD9 billing codes sometimes provide inaccurate approximations power loss PPV 0.70, NPV 0.95 power 45% vs 80% w/ gold standard labels (Harvard Catalyst, 2018) EHR Research 6 / 18

EHR Phenotyping Challenge: Who has what disease phenotype/outcome? Solution: build algorithms to predict phenotype (Harvard Catalyst, 2018) EHR Research 7 / 18

EHR Phenotyping Challenge: Who has what disease phenotype/outcome? Solution: build algorithms to predict phenotype Algorithm Development Major Steps: 1 identify features (Z) relevant to the phenotype 2 gold standard labels (Y) obtained via chart review (Harvard Catalyst, 2018) EHR Research 7 / 18

EHR Phenotyping Challenge: Who has what disease phenotype/outcome? Solution: build algorithms to predict phenotype Algorithm Development Major Steps: 1 identify features (Z) relevant to the phenotype 2 gold standard labels (Y) obtained via chart review 3 regression modeling Y g(x; θ) Ŷ (X) = g(x; θ) 4 prediction performance evaluation 5 apply the algorithm to the EMR to predict phenotype (Harvard Catalyst, 2018) EHR Research 7 / 18

Rheumatoid Arthritis (RA) Algorithm Development Partners Healthcare EMR Data Mart (N=29,432) at least 1 ICD9 code for RA or tested for anti-ccp Features (p 100): curated by domain experts codified variables (e.g. ICD9 billing codes, lab test results, medication prescription) NLP variables (e.g. NLP mention of symptoms, diseases, medication) Training set: n = 500 (chart reviewed) Algorithm developed via regularized estimation adaptive LASSO (Harvard Catalyst, 2018) EHR Research 8 / 18

RA Algorithm Development Partner s EMR AUC: 0.95; PPV: 0.94; vitual cohort size n = 4453 ( 15%) [Liao et al, 2010] Portability to other EMR AUC: 0.92 at Northwestern; 0.95 at Vanderbilt [Carroll et al, 2012] (Harvard Catalyst, 2018) EHR Research 9 / 18

Bottlenecks: Labor/Resource Intensive Algorithm development: costly in time and resource 1 identifying features: manual creation w/ clinical + NLP expert Solution: unsupervised feature selection leveraging online knowledge sources [Yu et al, 2015, 2016] 2 gold standard label chart review: clinical expert Solution: semi-supervised learning to improve estimation efficiency and hence reduce # of labels needed (Harvard Catalyst, 2018) EHR Research 10 / 18

Identifying Features: Automation Term Detection Concept Mapping Drug Grouping Frequency Control Automated Feature Extraction for Phenotyping Junk Filtering RankCor Control [Sheng et al, 2015,2016] online knowledge sources candidate features surrogate phenotypes data driven feature selection (Harvard Catalyst, 2018) EHR Research 11 / 18

Identifying Features: Automation Term Detection Concept Mapping Drug Grouping Frequency Control Automated Feature Extraction for Phenotyping Junk Filtering RankCor Control [Sheng et al, 2015,2016] online knowledge sources candidate features surrogate phenotypes data driven feature selection Results for RA classification with Partner s EMR candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP classification accuracy: AUC = 0.95 (Harvard Catalyst, 2018) EHR Research 11 / 18

Bottlenecks: Labor/Resource Intensive Algorithm development: costly in time and resource 1 identifying features 2 gold standard label chart review: clinical expert Solution: semi-supervised learning to improve estimation efficiency and hence reduce # of labels needed (Harvard Catalyst, 2018) EHR Research 12 / 18

Semi-Supervised Setting: Nature of the Data Unlabeled data: feature distribution (P X ) Question: Can we use unlabeled data to get a more efficient SSL procedure? Missing data problem? % missing in the outcome: 100% (Harvard Catalyst, 2018) EHR Research 13 / 18

Semi-supervised Learning (SSL) SSL procedure: Step I: learn a relationship beween Y and X using labeled data (L) Step II: impute the missing Y for the unlabeled data (U) Step III: regress the imputed outcome against X SSL estimator can be substantially more efficient than the supervised estimator under certain scenarios A robust combination procedure to guarantee that the SSL procedure will always be at least as efficient as the supervised. (Harvard Catalyst, 2018) EHR Research 14 / 18

Example: EHR Algorithm for Classifying Rheumatoid Arthritis n = 500 labeled observations N = 29, 000 unlabeled observations Features: ICD9 codes of rheumatoid arthritis and competing diagnosis, NLP mentions of clinical conditions/signs/symptoms medication prescription, lab results Results: Efficiency of SSL relative to supervised SSL 20% 380% times more efficient for regression coefficients SSL 50% 670% times more efficient for accuracy parameters such as sensitivity, specificity and AUC. (Harvard Catalyst, 2018) EHR Research 15 / 18

Genetic Risk Prediction Goal: predict the risk of Y = 1 using genetic marker G under P(Y = 1 G) = g(β 0 + β T G) Challenges: Y only available on a small set Algorithm scores S = (S 1,..., S k ) T for predicting Y are available on all patients, but they may not be entirely accurate or fully validated Question: how to efficiently estimate β evaluate the prediction performance of S k for Y (Harvard Catalyst, 2018) EHR Research 16 / 18

Genetic Risk Prediction Goal: predict the risk of Y = 1 using genetic marker G under P(Y = 1 G) = g(β 0 + β T G) Challenges: Y only available on a small set Algorithm scores S = (S 1,..., S k ) T for predicting Y are available on all patients, but they may not be entirely accurate or fully validated Question: how to efficiently estimate β evaluate the prediction performance of S k for Y Approach: Assumption: S relate to G only through Y, i.e. S G Y maximizing a composite non-parametric likelihood P(S k s Y ) and β. (Harvard Catalyst, 2018) EHR Research 16 / 18

Genetic Risk Prediction of CAD in RA Patients Goal: comparing genetic risk model estimating for CAD among RA patients via supervised methods with a full sample of 950 patients versus SSL with a subsample of 200 labels on CAD leveraging three phenotype algorithms S = (S ICD, S NLP, S Curated ) T. Full Data n = 200 Labels Variable β(se) p-value β(se) p-value age 0.09(0.01) 0.00 0.11(0.01) 0.00 sex 1.21(0.27) 0.00 1.51(0.31) 0.00 rs2479409 0.45(0.26) 0.08 0.56(0.30) 0.06 rs7206971 0.82(0.36) 0.02 0.75(0.37) 0.04 rs2902940 2.02(0.71) 0.00 1.96(0.55) 0.00 AUC Sensitivity ICD NLP ICD+NLP ICD NLP ICD+NLP 200 SL.93(.071).98(.020).99(.012).84(.142).80(.161).92(.093) SSL.97(.017).98(.005).99(.002).86(.062).86(.068).97(.017) Full SL.94(.015).98(.004).99(.002).86(.041).85(.039).97(.013) (Harvard Catalyst, 2018) EHR Research 17 / 18

Remarks EHR Data provides: Opportunities for novel big-data-analytics development optimal sampling design for chart review or marker measurement Unsupervised learning: Automated Feature Selection Automated Phenotype Prediction/Annotation Opportunities to improve in clinical practice and discovery research precision medicine: who should be treated by what more accurate diagnosis/prognosis capture the disease early longitudinal information enables dynamic prediction (Harvard Catalyst, 2018) EHR Research 18 / 18