Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan School of Public Health Department of Biomedical Informatics, Harvard Medical School Harvard Catalyst, 2018
Acknowledgment Jessica Gronsbell (Stanford) Hong Chuan (Harvard) Sheng Yu (Tsinghua University) David Cheng (Harvard) Abhishek Chakrabortty (Upenn) Issac Kohane (Harvard) Vivian Gainer (Partner s Healthchare) Victor Castro (Partner s Healthcare) Shawn Murphy (Partner s Healthcare) Ashwin Ananthakrishnan (MGH) Katherine Liao (BWH) NIH BD2K grant (Harvard Catalyst, 2018) EHR Research 2 / 18
Outline Opportunities & Challenges in using EHR for research Phenomewide Association Study (PheWAS) Genetic Risk Prediction Comparative Effectiveness Research/Causal Inference Efficient Phenotyping via Semi-supervised Learning (SSL) SSL Approach to Genetic Risk Modeling Remarks (Harvard Catalyst, 2018) EHR Research 3 / 18
Background EHR adoption rate Rich resource for research detailed longitudinal patient level data a wide range of disease conditions enables large scale genomic & comparative effectiveness studies Source: CDC NCHS Data Brief (2014) (Harvard Catalyst, 2018) EHR Research 4 / 18
EHR Data structured data: ICD9 billing codes; lab results etc unstructured text data: extracted via natural language processing (NLP) clinical term concept unique identifiers (CUI) [Liao et al, 2015] (Harvard Catalyst, 2018) EHR Research 5 / 18
Integrative Analysis of Electronic Medical Records (EMR) Data EHR linked with bio-respository PheWAS Bio-repository Genomic Risk Prediction of Disease EMR Comparative Effective Research Pharmacogenomics (Harvard Catalyst, 2018) EHR Research 6 / 18
Integrative Analysis of Electronic Medical Records (EMR) Data EHR linked with bio-respository PheWAS Bio-repository Genomic Risk Prediction of Disease EMR Comparative Effective Research Pharmacogenomics A Major Challenge Precise info on phenotype/treatment response not readily available ICD9 billing codes sometimes provide inaccurate approximations power loss PPV 0.70, NPV 0.95 power 45% vs 80% w/ gold standard labels (Harvard Catalyst, 2018) EHR Research 6 / 18
EHR Phenotyping Challenge: Who has what disease phenotype/outcome? Solution: build algorithms to predict phenotype (Harvard Catalyst, 2018) EHR Research 7 / 18
EHR Phenotyping Challenge: Who has what disease phenotype/outcome? Solution: build algorithms to predict phenotype Algorithm Development Major Steps: 1 identify features (Z) relevant to the phenotype 2 gold standard labels (Y) obtained via chart review (Harvard Catalyst, 2018) EHR Research 7 / 18
EHR Phenotyping Challenge: Who has what disease phenotype/outcome? Solution: build algorithms to predict phenotype Algorithm Development Major Steps: 1 identify features (Z) relevant to the phenotype 2 gold standard labels (Y) obtained via chart review 3 regression modeling Y g(x; θ) Ŷ (X) = g(x; θ) 4 prediction performance evaluation 5 apply the algorithm to the EMR to predict phenotype (Harvard Catalyst, 2018) EHR Research 7 / 18
Rheumatoid Arthritis (RA) Algorithm Development Partners Healthcare EMR Data Mart (N=29,432) at least 1 ICD9 code for RA or tested for anti-ccp Features (p 100): curated by domain experts codified variables (e.g. ICD9 billing codes, lab test results, medication prescription) NLP variables (e.g. NLP mention of symptoms, diseases, medication) Training set: n = 500 (chart reviewed) Algorithm developed via regularized estimation adaptive LASSO (Harvard Catalyst, 2018) EHR Research 8 / 18
RA Algorithm Development Partner s EMR AUC: 0.95; PPV: 0.94; vitual cohort size n = 4453 ( 15%) [Liao et al, 2010] Portability to other EMR AUC: 0.92 at Northwestern; 0.95 at Vanderbilt [Carroll et al, 2012] (Harvard Catalyst, 2018) EHR Research 9 / 18
Bottlenecks: Labor/Resource Intensive Algorithm development: costly in time and resource 1 identifying features: manual creation w/ clinical + NLP expert Solution: unsupervised feature selection leveraging online knowledge sources [Yu et al, 2015, 2016] 2 gold standard label chart review: clinical expert Solution: semi-supervised learning to improve estimation efficiency and hence reduce # of labels needed (Harvard Catalyst, 2018) EHR Research 10 / 18
Identifying Features: Automation Term Detection Concept Mapping Drug Grouping Frequency Control Automated Feature Extraction for Phenotyping Junk Filtering RankCor Control [Sheng et al, 2015,2016] online knowledge sources candidate features surrogate phenotypes data driven feature selection (Harvard Catalyst, 2018) EHR Research 11 / 18
Identifying Features: Automation Term Detection Concept Mapping Drug Grouping Frequency Control Automated Feature Extraction for Phenotyping Junk Filtering RankCor Control [Sheng et al, 2015,2016] online knowledge sources candidate features surrogate phenotypes data driven feature selection Results for RA classification with Partner s EMR candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP classification accuracy: AUC = 0.95 (Harvard Catalyst, 2018) EHR Research 11 / 18
Bottlenecks: Labor/Resource Intensive Algorithm development: costly in time and resource 1 identifying features 2 gold standard label chart review: clinical expert Solution: semi-supervised learning to improve estimation efficiency and hence reduce # of labels needed (Harvard Catalyst, 2018) EHR Research 12 / 18
Semi-Supervised Setting: Nature of the Data Unlabeled data: feature distribution (P X ) Question: Can we use unlabeled data to get a more efficient SSL procedure? Missing data problem? % missing in the outcome: 100% (Harvard Catalyst, 2018) EHR Research 13 / 18
Semi-supervised Learning (SSL) SSL procedure: Step I: learn a relationship beween Y and X using labeled data (L) Step II: impute the missing Y for the unlabeled data (U) Step III: regress the imputed outcome against X SSL estimator can be substantially more efficient than the supervised estimator under certain scenarios A robust combination procedure to guarantee that the SSL procedure will always be at least as efficient as the supervised. (Harvard Catalyst, 2018) EHR Research 14 / 18
Example: EHR Algorithm for Classifying Rheumatoid Arthritis n = 500 labeled observations N = 29, 000 unlabeled observations Features: ICD9 codes of rheumatoid arthritis and competing diagnosis, NLP mentions of clinical conditions/signs/symptoms medication prescription, lab results Results: Efficiency of SSL relative to supervised SSL 20% 380% times more efficient for regression coefficients SSL 50% 670% times more efficient for accuracy parameters such as sensitivity, specificity and AUC. (Harvard Catalyst, 2018) EHR Research 15 / 18
Genetic Risk Prediction Goal: predict the risk of Y = 1 using genetic marker G under P(Y = 1 G) = g(β 0 + β T G) Challenges: Y only available on a small set Algorithm scores S = (S 1,..., S k ) T for predicting Y are available on all patients, but they may not be entirely accurate or fully validated Question: how to efficiently estimate β evaluate the prediction performance of S k for Y (Harvard Catalyst, 2018) EHR Research 16 / 18
Genetic Risk Prediction Goal: predict the risk of Y = 1 using genetic marker G under P(Y = 1 G) = g(β 0 + β T G) Challenges: Y only available on a small set Algorithm scores S = (S 1,..., S k ) T for predicting Y are available on all patients, but they may not be entirely accurate or fully validated Question: how to efficiently estimate β evaluate the prediction performance of S k for Y Approach: Assumption: S relate to G only through Y, i.e. S G Y maximizing a composite non-parametric likelihood P(S k s Y ) and β. (Harvard Catalyst, 2018) EHR Research 16 / 18
Genetic Risk Prediction of CAD in RA Patients Goal: comparing genetic risk model estimating for CAD among RA patients via supervised methods with a full sample of 950 patients versus SSL with a subsample of 200 labels on CAD leveraging three phenotype algorithms S = (S ICD, S NLP, S Curated ) T. Full Data n = 200 Labels Variable β(se) p-value β(se) p-value age 0.09(0.01) 0.00 0.11(0.01) 0.00 sex 1.21(0.27) 0.00 1.51(0.31) 0.00 rs2479409 0.45(0.26) 0.08 0.56(0.30) 0.06 rs7206971 0.82(0.36) 0.02 0.75(0.37) 0.04 rs2902940 2.02(0.71) 0.00 1.96(0.55) 0.00 AUC Sensitivity ICD NLP ICD+NLP ICD NLP ICD+NLP 200 SL.93(.071).98(.020).99(.012).84(.142).80(.161).92(.093) SSL.97(.017).98(.005).99(.002).86(.062).86(.068).97(.017) Full SL.94(.015).98(.004).99(.002).86(.041).85(.039).97(.013) (Harvard Catalyst, 2018) EHR Research 17 / 18
Remarks EHR Data provides: Opportunities for novel big-data-analytics development optimal sampling design for chart review or marker measurement Unsupervised learning: Automated Feature Selection Automated Phenotype Prediction/Annotation Opportunities to improve in clinical practice and discovery research precision medicine: who should be treated by what more accurate diagnosis/prognosis capture the disease early longitudinal information enables dynamic prediction (Harvard Catalyst, 2018) EHR Research 18 / 18