Predictive and Causal Modeling in the Health Sciences Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics 1
Exponentially Rapid Data Accumulation Protein Sequencing via MS 1986 First GWAS Study Published; NGS 2005 Single Cell Sequencing 2012 1975 Rapid DNA Sequencing 1982 GeneBank Formed 1990 Human Genome Project Initiated 2003 Completion of Human Genome Sequencing PDB initiated 2006 TCGA Initiated 1,000 Genomes Initiated 2010 Human Connectome Project 2016 TCGA Completed >10,000 Tumors 2
From Data to Discoveries Advanced Data Preparation, Analysis and Modeling methods are needed for knowledge discovery in high volume, high variety data. Two key types: Predictive Modeling and Computational Causal Discovery Predictive Model Causal Model Predictive Knowledge Causal Knowledge Screening Diagnostics Prognostics Intervention Therapeutics 3
Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 4
Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 5
Predictive Models : the Goal 6
Example of Predictive Modeling : Support Vector Machines (SVMs) Key Characteristics of SVM Maximum gap to prevent overfitting QP problems can be solved with standard methods. Soft margins to tolerate noise Kernel trick for linearly non-separable data Boser et al.1992; Statnikov et al., 2011 Support Vector Machine 7
Predictive Models : the Goal 8
Predictive Modeling: a Simplified General Framework 9
Predictive Modeling: Cross validation for performance estimation and model selection Ma et al., 2015 (in preparation) 10
Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling and its Applications o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 11
Predictive Modeling for Post-traumatic Stress Post-traumatic Stress Response: Almost everyone experience at least one traumatic event in their life. Most people display acute stress responses. Acute stress responses diminish over time in most individuals, but about 10% - 20% people experience non-remitting stress responses long after the trauma. Persistent stress is detrimental to Physiological and psychological well-being of individuals. Galatzer-Levy et al., 2015; Ma et al. 2015; Galatzer-Levy et al., 2015 (submitted) 12
Predictive Modeling for Post-traumatic Stress Discovery Goals/Questions: Can we identify the people who will suffer from nonremitting stress responses? If so, can they be identified early enough? What types of data need to be collected to identify people who will suffer from non-remitting stress responses? 13
Predictive Modeling for Post-traumatic Stress Data: 1012 8947 1012 9238 8947 1012 7498 9238947 1881 7498 9238 7989 1881 7498 7989 1881 7989 1012 8947 1012 9238 8947 7498 9238 1567 1881 7498 5672 7989 1881 3082 7989 5257 3213 1012 8947 1012 9238 8947 9982 7498 9238 3498 1881 7498 9238 7989 1881 7498 7989240 9880 1012 8947 1012 9238 8947 8847 7498 9238 2923 1881 7498761 7989 1881 9128 7989 7612 8764 1012 8947 1012 9238 8947 1123 7498 9238 4324 1881 7498 7498 7989 1881 2318 7989 8132 4742 166 trauma survivors that were admitted to the ER were followed up to 4 month after the trauma. Patient history, clinical data, stress hormones, psychiatric related measurements were collected in the ER, 1 week, 1 month, and 4 month after the trauma. A total number of 135 variables were collected. 14
Predictive Modeling for Post-traumatic Stress Remitting and Non-remitting Post-traumatic Stress Responses (Identified via Latent Growth Mixture Modeling) 15
Predictive Modeling for Post-traumatic Stress Discovery Goals/Questions: Can we identify the people who will suffer from nonremitting stress responses? If so, can they be identified early enough? What types of data need to be collected to identify people who will suffer from non-remitting stress responses? 16
Predictive Model for Post-traumatic Stress Study Design: Five predictive models were build using data incorporating increasing amounts of information: (1) background data (2) Data collected through ER (3) Data collected through 1 week (4) Data collected through 1 month (5) Data collected though 4 month SVM with feature selection was employed, with 10 split 5 fold cross-validation 17
Predictive Modeling for Post-traumatic Stress Prediction accuracy increases progressively as data collected at later time points are added to the predictive models. Predictivity of the model built with patient background information is statistically significant. Model built with patient background information and data collected in the ER have strong enough predictive performance to be clinically useful. 18
Predictive Modeling for Post-traumatic Stress Discovery Goals/Questions: Can we identify the people who will suffer from nonremitting stress responses? If so, can they be identified early enough? What types of data need to be collected to identify people who will suffer from non-remitting stress responses? Specifically, can neuroendocrine levels predict non-remitting post-traumatic stress? 19
Predictive Modeling for Post-traumatic Stress Neuroendocrine data studied contain limited information for non-remitting stress response. Except at the time of ER, combining neuroendocrine and other data (clinical information, psychiatric surveys) do not significantly increase predictivity of the models. 20
Other Case Studies for Predicting Modeling Predicting Cancer Patient Outcome Predicting Neural Activity in the Dorsolateral Striatum Predicting Transposon Insertion 21
Other Case Studies for Predicting Modeling Predicting Cancer Patient Outcome Problem: Determine the most informative data modality for predicting cancer patient outcome Data: 47 datasets/predictive tasks that in total span over 9 data modalities including copy number, gene expression, protein expression, mico-rna expression, imaging, GWAS, somatic mutation, methylation, and clinical information. Conclusion: Gene expression is in generally the most informative data modality. Combining different data modality do not increase predictive performance. Ray MS, Henaff MS, Aliferis PhD, Statnikov PhD @NYU Ray et al., 2014 22
Other Case Studies for Predicting Modeling Predicting Neural Activity in the Dorsolateral Striatum (DLS) Problem: Predict neural activity from movement data Data: Single Neuron Activity in the DLS Head Movement Tracking Data Model: Linear-Non-linear-Poisson Model to predict neural activity from head movement profile of the animal and spike history of the neuron. Reconstructed neural activity in subpopulation of the neurons. David Barker PhD @ NIDA Ma and Barker, 2014 23
Other Case Studies for Predicting Modeling Predicting Transposon Insertion Problem: Identify transposon insertion location in the genome. Data: Targeted Sequencing Data. Model: train logistic regression model on a set of annotated transposon insertion sites and apply the model for de-novo insertion identification. More than 95% of the de-novo insertion identified by the model was validated by experiments. Zuojian Tang MS, David Fenyo PhD, Jeff Boeke PhD @NYU Langone, Kathleen Burns @ JHU 24
Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 25
Causal Modeling: the Goal 26
Causal Modeling: the Goal 27
Causal Modeling: Causal graphs Capture Direct, Indirect Relationships 28
Causal Modeling: V-structures a Common Technique for Orienting Causal Relationships 29
Casual Modeling: PC Algorithm a prototypical causal discovery algorithm PC algorithm: Skeleton Discovery Sprites et al., 1993 30
Casual Modeling: PC Algorithm PC algorithm: Skeleton Discovery, Trace 31
Casual Modeling: PC Algorithm PC algorithm: Orientation 32
Causal Modeling: HITON-PC Algorithm A B E T D C Local causal discovery method Easily extended for global causal discovery with the LGL framework. Aliferis et al., 2010 33
Causal Modeling: HITON-PC Algorithm Trace of HITON-PC A B E T D C 34
Causal Modeling: Semi-Interleaved HITON-PC a more efficient implementation Efficient, and robust. Scalable to very BIG DATA. Easily extended for global causal discovery with the LGL framework. An instantiation of the GLL framework. 35
Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 36
Causal Modeling for Post-traumatic Stress Study Data: 1012 8947 1012 9238 8947 1012 7498 9238947 1881 7498 9238 7989 1881 7498 7989 1881 7989 1012 8947 1012 9238 8947 7498 9238 1567 1881 7498 5672 7989 1881 3082 7989 5257 3213 1012 8947 1012 9238 8947 9982 7498 9238 3498 1881 7498 9238 7989 1881 7498 7989240 9880 1012 8947 1012 9238 8947 8847 7498 9238 2923 1881 7498761 7989 1881 9128 7989 7612 8764 1012 8947 1012 9238 8947 1123 7498 9238 4324 1881 7498 7498 7989 1881 2318 7989 8132 4742 166 trauma survivors that were admitted to the ER were followed up to 4 month after the trauma. Patient history, clinical data, stress hormones, psychiatric related measurements were collected in the ER, 1 week, 1 month, and 4 month after the trauma. A total number of 135 variables were collected. Galatzer-Levy et al., 2015; Ma et al. 2015; Galatzer-Levy et al., 2015 (submitted) 37
Causal Model for Post-traumatic Stress Causal Discovery Question: What are the factors determining non-remitting stress responses? Analysis Design: Apply local causal discovery algorithms (HITON-PC) to find the parent children sets for all measured variables A global causal graph depicting the relationship among all measured variables were constructed using the local to global framework LGL. Edges were oriented according the time that individual variables were measured. 38
Causal Modeling for Post-traumatic Stress The Global Causal Graph A very complicated model! 39
Causal Modeling for Post-traumatic Stress Example Causal Path Leading to non-remitting Stress Responses 40
Causal Modeling for Post-traumatic Stress Potential intervention for non-remitting Stress Responses 41
Causal Modeling for Post-traumatic Stress Potential Intervention for non-remitting Stress Responses 42
Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 43
Causal Model Guided Experimental Minimization and Adaptive Data Collection Goals: Reduce number of experiments that experimentalists need to do in order to fully resolve a biological pathway (or other complex set of causal interactions among variables of interest). Reduce time to discovery Reduce costs 44
Causal Model Guided Experimental Minimization and Adaptive Data Collection Special Importance In Health Sciences with both omics data and clinical data: One variable could be univariately associated with hundred to thousand variables: Drivers: direct and indirect Passengers Effects High degree of multiplicity. Classical statistical techniques exhibit both increased false positives and negatives 45
Causal Model-Guided Experimental Minimization and Adaptive Data Collection Simplified view of the Framework: 46
Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm: Output: Local causal pathway (parents and children) of the variable of interest. Two Phases: Identify local causal pathway consistent with the data and information equivalent clusters. Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway. Statnikov et al., 2015 (Accepted) 47
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Pseudo Code: The ODLP Algorithm: Output: Local causal pathway (parents and children) of the variable of interest. Two Phases: Identify local causal pathway consistent with the data and information equivalent clusters. Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway. 48
Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm Phase I: Identify local causal pathway consistent with the data and information equivalent clusters (TIE*, itie* algorithms). 49
Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm Phase I: itie* 50
Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm Phase II: Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway. (i.e. Identify Causes, Effects, and Passengers). 51
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Identifying effects Manipulate T and obtain experimental data D E. Mark all variables in V that change in D E due to manipulation of T as effects. effects 52
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: direct and indirect effects Select an effect variable X that has neither been marked as indirect effect nor as direct effect. Manipulate X and obtain experimental data D E. Mark all effect variables that change in D E due to manipulation of X and belong to the same equivalence cluster as indirect effects. The last effect variable in an equivalent cluster that is not marked as indirect effect is a direct effect. Indirect effect 53
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Identifying Passengers Select an unmarked variable X from an equivalence cluster. Manipulate X and obtain experimental data D E. If T does not change in D E due to manipulation of X, mark X as a passenger and mark all other non-effect variables that change in D E due to manipulation of X as passengers; otherwise mark X as a cause. Passengers 54
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Identifying Causes For every cause X, mark X as a direct cause if there exist no other cause in the same equivalence cluster that changes due to manipulation of X; otherwise mark X as an Indirect cause. If there is an equivalence cluster that contains a single unmarked variable X and all marked variables in this cluster (if any) are only passengers and/or effects, then mark X as a direct cause. 55
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Performance on Simulated Data Benchmark study 58 algorithms/variant from 4 algorithm families. 11 networks of different sizes. Statnikov et al., 2015 (Accepted) 56
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Network Reconstruction Quality 57
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Reconstruction Quality & Efficiency 58
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Scalability 59
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Performance on Real Biological Data Ma et al., 2015 (submitted) 60
Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Performance on Real Biological Data 61
Summary Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 62
Future directions Improve Existing algorithms (e.g., relax some application assumptions). Design and Implement Analysis Pipelines that can be used by non experts. Disseminate Software and Analytics Packages. Apply these techniques broadly in different domains. Educate researchers about the capabilities (and limitations) as well as proper use of these and related methods. 63