Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics

Similar documents
Alexander Statnikov, Ph.D.

Smart India Hackathon

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

BIOINFORMATICS THE MACHINE LEARNING APPROACH

Neural Networks and Applications in Bioinformatics

Introduction to BIOINFORMATICS

Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development

in Biomedicine A Gentle Introduction to Support Vector Machines Volume 1: Theory and Methods

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data

Knowledge-Guided Analysis with KnowEnG Lab

Complex Adaptive Systems Forum: Transformative CAS Initiatives in Biomedicine

Data Mining for Biological Data Analysis

2017 HTS-CSRS COMMUNITY PUBLIC WORKSHOP

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Introduction to Bioinformatics

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics

Inferring Gene Networks from Microarray Data using a Hybrid GA p.1

Pioneering Clinical Omics

C-14 FINDING THE RIGHT SYNERGY FROM GLMS AND MACHINE LEARNING. CAS Annual Meeting November 7-10

Classification of DNA Sequences Using Convolutional Neural Network Approach

Capabilities & Services

Analytics Behind Genomic Testing

OncoMD User Manual Version 2.6. OncoMD: Cancer Analytics Platform

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Representation in Supervised Machine Learning Application to Biological Problems

Gene expression connectivity mapping and its application to Cat-App

DNA. Clinical Trials. Research RNA. Custom. Reports CLIA CAP GCP. Tumor Genomic Profiling Services for Clinical Trials

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Exploring the Genetic Basis of Congenital Heart Defects

Assay Validation Services

Our website:

Introduction to Bioinformatics

Christoph Bock ICPerMed First Research Workshop Milano, 26 June 2017

Data representation for clinical data and metadata

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Our view on cdna chip analysis from engineering informatics standpoint

MediSapiens Ltd. Because data is not knowledge. 4th of November Sami Kilpinen, Ph.D Co-founder, CEO MediSapiens Ltd

HITON, A Novel Markov Blanket Algorithm for Optimal Variable Selection

2017 Qualifying Examination

Introduction to Bioinformatics

Machine Learning in Computational Biology CSC 2431

Computational Challenges of Medical Genomics

Gene Expression Data Analysis

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Gene Therapy: The Basics. Mark A. Kay MD PhD Dennis Farrey Family Professor Stanford University

Research Powered by Agilent s GeneSpring

What is Genetic Engineering?

Introduction to Machine Learning for Longitudinal Medical Data

ILLUMINA SEQUENCING SYSTEMS

First Annual Biomarker Symposium Quest Diagnostics Clinical Trials

Machine Learning Models for Classification of Lung Cancer and Selection of Genomic Markers Using Array Gene Expression Data

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

SAS Microarray Solution for the Analysis of Microarray Data. Susanne Schwenke, Schering AG Dr. Richardus Vonk, Schering AG

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

Big Data Standards and the Potential Long-Term Benefits for Research and Clinical Development

2. Materials and Methods

Corporate Overview. December Erik Holmlin President & CEO

Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

Lecture 8: Predicting and analyzing metagenomic composition from 16S survey data

Growing Needs for Practical Molecular Diagnostics: Indonesia s Preparedness for Current Trend

Proteogenomics. Kelly Ruggles, Ph.D. Proteomics Informatics Week 9

RNA-SEQUENCING ANALYSIS

The application of hidden markov model in building genetic regulatory network

Multivariate Methods to detecting co-related trends in data

Genetics and Bioinformatics

Information Driven Biomedicine. Prof. Santosh K. Mishra Executive Director, BII CIAPR IV Shanghai, May

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

What is Evolutionary Computation? Genetic Algorithms. Components of Evolutionary Computing. The Argument. When changes occur...

BIOSTATISTICS AND MEDICAL INFORMATICS (B M I)

In silico prediction of novel therapeutic targets using gene disease association data

Description of expands

Textbook Reading Guidelines

Software Engineering. Engineering & Technology. Applied Sciences. Domain Knowledge. Robust Processes

Biomedical Big Data and Precision Medicine

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner

G E N OM I C S S E RV I C ES

Information Technology for Genetic and Genomic Based Personalized Medicine. Submitted. April 23, 2008

BUSINESS DATA MINING (IDS 572) Please include the names of all team-members in your write up and in the name of the file.

Course Presentation. Ignacio Medina Presentation

Clinician s Guide to Actionable Genes and Genome Interpretation

M a x i m i z in g Value from NGS Analytics in t h e E n terprise

Clinical and Translational Bioinformatics

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

The flow diagram below shows part of a process to produce a protein, using genetically modified plants.

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

NLM Funded Research Projects Involving Text Mining/NLP

Ontologies - Useful tools in Life Sciences and Forensics

Overview of Health Informatics. ITI BMI-Dept

GENOMICS for DUMMIES

296 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

Proteomics And Cancer Biomarker Discovery. Dr. Zahid Khan Institute of chemical Sciences (ICS) University of Peshawar. Overview. Cancer.

GenScale Scalable, Optimized and Parallel Algorithms for Genomics. Dominique LAVENIER

IntelliSpace Genomics

The Integrated Biomedical Sciences Graduate Program

The NHS approach to personalised medicine in respiratory disease. Professor Sue Chief Scientific Officer for England

Whole Genome Sequencing in Cancer Diagnostics (research) Nederlandse Pathologiedagen 19 & 20 November 2015

Medical Devices; Immunology and Microbiology Devices; Classification of the Next Generation

Transcription:

Predictive and Causal Modeling in the Health Sciences Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics 1

Exponentially Rapid Data Accumulation Protein Sequencing via MS 1986 First GWAS Study Published; NGS 2005 Single Cell Sequencing 2012 1975 Rapid DNA Sequencing 1982 GeneBank Formed 1990 Human Genome Project Initiated 2003 Completion of Human Genome Sequencing PDB initiated 2006 TCGA Initiated 1,000 Genomes Initiated 2010 Human Connectome Project 2016 TCGA Completed >10,000 Tumors 2

From Data to Discoveries Advanced Data Preparation, Analysis and Modeling methods are needed for knowledge discovery in high volume, high variety data. Two key types: Predictive Modeling and Computational Causal Discovery Predictive Model Causal Model Predictive Knowledge Causal Knowledge Screening Diagnostics Prognostics Intervention Therapeutics 3

Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 4

Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 5

Predictive Models : the Goal 6

Example of Predictive Modeling : Support Vector Machines (SVMs) Key Characteristics of SVM Maximum gap to prevent overfitting QP problems can be solved with standard methods. Soft margins to tolerate noise Kernel trick for linearly non-separable data Boser et al.1992; Statnikov et al., 2011 Support Vector Machine 7

Predictive Models : the Goal 8

Predictive Modeling: a Simplified General Framework 9

Predictive Modeling: Cross validation for performance estimation and model selection Ma et al., 2015 (in preparation) 10

Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling and its Applications o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 11

Predictive Modeling for Post-traumatic Stress Post-traumatic Stress Response: Almost everyone experience at least one traumatic event in their life. Most people display acute stress responses. Acute stress responses diminish over time in most individuals, but about 10% - 20% people experience non-remitting stress responses long after the trauma. Persistent stress is detrimental to Physiological and psychological well-being of individuals. Galatzer-Levy et al., 2015; Ma et al. 2015; Galatzer-Levy et al., 2015 (submitted) 12

Predictive Modeling for Post-traumatic Stress Discovery Goals/Questions: Can we identify the people who will suffer from nonremitting stress responses? If so, can they be identified early enough? What types of data need to be collected to identify people who will suffer from non-remitting stress responses? 13

Predictive Modeling for Post-traumatic Stress Data: 1012 8947 1012 9238 8947 1012 7498 9238947 1881 7498 9238 7989 1881 7498 7989 1881 7989 1012 8947 1012 9238 8947 7498 9238 1567 1881 7498 5672 7989 1881 3082 7989 5257 3213 1012 8947 1012 9238 8947 9982 7498 9238 3498 1881 7498 9238 7989 1881 7498 7989240 9880 1012 8947 1012 9238 8947 8847 7498 9238 2923 1881 7498761 7989 1881 9128 7989 7612 8764 1012 8947 1012 9238 8947 1123 7498 9238 4324 1881 7498 7498 7989 1881 2318 7989 8132 4742 166 trauma survivors that were admitted to the ER were followed up to 4 month after the trauma. Patient history, clinical data, stress hormones, psychiatric related measurements were collected in the ER, 1 week, 1 month, and 4 month after the trauma. A total number of 135 variables were collected. 14

Predictive Modeling for Post-traumatic Stress Remitting and Non-remitting Post-traumatic Stress Responses (Identified via Latent Growth Mixture Modeling) 15

Predictive Modeling for Post-traumatic Stress Discovery Goals/Questions: Can we identify the people who will suffer from nonremitting stress responses? If so, can they be identified early enough? What types of data need to be collected to identify people who will suffer from non-remitting stress responses? 16

Predictive Model for Post-traumatic Stress Study Design: Five predictive models were build using data incorporating increasing amounts of information: (1) background data (2) Data collected through ER (3) Data collected through 1 week (4) Data collected through 1 month (5) Data collected though 4 month SVM with feature selection was employed, with 10 split 5 fold cross-validation 17

Predictive Modeling for Post-traumatic Stress Prediction accuracy increases progressively as data collected at later time points are added to the predictive models. Predictivity of the model built with patient background information is statistically significant. Model built with patient background information and data collected in the ER have strong enough predictive performance to be clinically useful. 18

Predictive Modeling for Post-traumatic Stress Discovery Goals/Questions: Can we identify the people who will suffer from nonremitting stress responses? If so, can they be identified early enough? What types of data need to be collected to identify people who will suffer from non-remitting stress responses? Specifically, can neuroendocrine levels predict non-remitting post-traumatic stress? 19

Predictive Modeling for Post-traumatic Stress Neuroendocrine data studied contain limited information for non-remitting stress response. Except at the time of ER, combining neuroendocrine and other data (clinical information, psychiatric surveys) do not significantly increase predictivity of the models. 20

Other Case Studies for Predicting Modeling Predicting Cancer Patient Outcome Predicting Neural Activity in the Dorsolateral Striatum Predicting Transposon Insertion 21

Other Case Studies for Predicting Modeling Predicting Cancer Patient Outcome Problem: Determine the most informative data modality for predicting cancer patient outcome Data: 47 datasets/predictive tasks that in total span over 9 data modalities including copy number, gene expression, protein expression, mico-rna expression, imaging, GWAS, somatic mutation, methylation, and clinical information. Conclusion: Gene expression is in generally the most informative data modality. Combining different data modality do not increase predictive performance. Ray MS, Henaff MS, Aliferis PhD, Statnikov PhD @NYU Ray et al., 2014 22

Other Case Studies for Predicting Modeling Predicting Neural Activity in the Dorsolateral Striatum (DLS) Problem: Predict neural activity from movement data Data: Single Neuron Activity in the DLS Head Movement Tracking Data Model: Linear-Non-linear-Poisson Model to predict neural activity from head movement profile of the animal and spike history of the neuron. Reconstructed neural activity in subpopulation of the neurons. David Barker PhD @ NIDA Ma and Barker, 2014 23

Other Case Studies for Predicting Modeling Predicting Transposon Insertion Problem: Identify transposon insertion location in the genome. Data: Targeted Sequencing Data. Model: train logistic regression model on a set of annotated transposon insertion sites and apply the model for de-novo insertion identification. More than 95% of the de-novo insertion identified by the model was validated by experiments. Zuojian Tang MS, David Fenyo PhD, Jeff Boeke PhD @NYU Langone, Kathleen Burns @ JHU 24

Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 25

Causal Modeling: the Goal 26

Causal Modeling: the Goal 27

Causal Modeling: Causal graphs Capture Direct, Indirect Relationships 28

Causal Modeling: V-structures a Common Technique for Orienting Causal Relationships 29

Casual Modeling: PC Algorithm a prototypical causal discovery algorithm PC algorithm: Skeleton Discovery Sprites et al., 1993 30

Casual Modeling: PC Algorithm PC algorithm: Skeleton Discovery, Trace 31

Casual Modeling: PC Algorithm PC algorithm: Orientation 32

Causal Modeling: HITON-PC Algorithm A B E T D C Local causal discovery method Easily extended for global causal discovery with the LGL framework. Aliferis et al., 2010 33

Causal Modeling: HITON-PC Algorithm Trace of HITON-PC A B E T D C 34

Causal Modeling: Semi-Interleaved HITON-PC a more efficient implementation Efficient, and robust. Scalable to very BIG DATA. Easily extended for global causal discovery with the LGL framework. An instantiation of the GLL framework. 35

Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 36

Causal Modeling for Post-traumatic Stress Study Data: 1012 8947 1012 9238 8947 1012 7498 9238947 1881 7498 9238 7989 1881 7498 7989 1881 7989 1012 8947 1012 9238 8947 7498 9238 1567 1881 7498 5672 7989 1881 3082 7989 5257 3213 1012 8947 1012 9238 8947 9982 7498 9238 3498 1881 7498 9238 7989 1881 7498 7989240 9880 1012 8947 1012 9238 8947 8847 7498 9238 2923 1881 7498761 7989 1881 9128 7989 7612 8764 1012 8947 1012 9238 8947 1123 7498 9238 4324 1881 7498 7498 7989 1881 2318 7989 8132 4742 166 trauma survivors that were admitted to the ER were followed up to 4 month after the trauma. Patient history, clinical data, stress hormones, psychiatric related measurements were collected in the ER, 1 week, 1 month, and 4 month after the trauma. A total number of 135 variables were collected. Galatzer-Levy et al., 2015; Ma et al. 2015; Galatzer-Levy et al., 2015 (submitted) 37

Causal Model for Post-traumatic Stress Causal Discovery Question: What are the factors determining non-remitting stress responses? Analysis Design: Apply local causal discovery algorithms (HITON-PC) to find the parent children sets for all measured variables A global causal graph depicting the relationship among all measured variables were constructed using the local to global framework LGL. Edges were oriented according the time that individual variables were measured. 38

Causal Modeling for Post-traumatic Stress The Global Causal Graph A very complicated model! 39

Causal Modeling for Post-traumatic Stress Example Causal Path Leading to non-remitting Stress Responses 40

Causal Modeling for Post-traumatic Stress Potential intervention for non-remitting Stress Responses 41

Causal Modeling for Post-traumatic Stress Potential Intervention for non-remitting Stress Responses 42

Talk Outline Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 43

Causal Model Guided Experimental Minimization and Adaptive Data Collection Goals: Reduce number of experiments that experimentalists need to do in order to fully resolve a biological pathway (or other complex set of causal interactions among variables of interest). Reduce time to discovery Reduce costs 44

Causal Model Guided Experimental Minimization and Adaptive Data Collection Special Importance In Health Sciences with both omics data and clinical data: One variable could be univariately associated with hundred to thousand variables: Drivers: direct and indirect Passengers Effects High degree of multiplicity. Classical statistical techniques exhibit both increased false positives and negatives 45

Causal Model-Guided Experimental Minimization and Adaptive Data Collection Simplified view of the Framework: 46

Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm: Output: Local causal pathway (parents and children) of the variable of interest. Two Phases: Identify local causal pathway consistent with the data and information equivalent clusters. Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway. Statnikov et al., 2015 (Accepted) 47

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Pseudo Code: The ODLP Algorithm: Output: Local causal pathway (parents and children) of the variable of interest. Two Phases: Identify local causal pathway consistent with the data and information equivalent clusters. Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway. 48

Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm Phase I: Identify local causal pathway consistent with the data and information equivalent clusters (TIE*, itie* algorithms). 49

Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm Phase I: itie* 50

Causal Model Guided Experimental Minimization and Adaptive Data Collection The ODLP Algorithm Phase II: Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway. (i.e. Identify Causes, Effects, and Passengers). 51

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Identifying effects Manipulate T and obtain experimental data D E. Mark all variables in V that change in D E due to manipulation of T as effects. effects 52

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: direct and indirect effects Select an effect variable X that has neither been marked as indirect effect nor as direct effect. Manipulate X and obtain experimental data D E. Mark all effect variables that change in D E due to manipulation of X and belong to the same equivalence cluster as indirect effects. The last effect variable in an equivalent cluster that is not marked as indirect effect is a direct effect. Indirect effect 53

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Identifying Passengers Select an unmarked variable X from an equivalence cluster. Manipulate X and obtain experimental data D E. If T does not change in D E due to manipulation of X, mark X as a passenger and mark all other non-effect variables that change in D E due to manipulation of X as passengers; otherwise mark X as a cause. Passengers 54

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP: Identifying Causes For every cause X, mark X as a direct cause if there exist no other cause in the same equivalence cluster that changes due to manipulation of X; otherwise mark X as an Indirect cause. If there is an equivalence cluster that contains a single unmarked variable X and all marked variables in this cluster (if any) are only passengers and/or effects, then mark X as a direct cause. 55

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Performance on Simulated Data Benchmark study 58 algorithms/variant from 4 algorithm families. 11 networks of different sizes. Statnikov et al., 2015 (Accepted) 56

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Network Reconstruction Quality 57

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Reconstruction Quality & Efficiency 58

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Scalability 59

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Performance on Real Biological Data Ma et al., 2015 (submitted) 60

Causal Model Guided Experimental Minimization and Adaptive Data Collection ODLP vs Other Algorithms: Performance on Real Biological Data 61

Summary Predictive Modeling o Brief Introduction to Predictive Modeling o Indicative Case Studies Causal Modeling o Causal Modeling using Observation Data o Indicative Case Studies o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection 62

Future directions Improve Existing algorithms (e.g., relax some application assumptions). Design and Implement Analysis Pipelines that can be used by non experts. Disseminate Software and Analytics Packages. Apply these techniques broadly in different domains. Educate researchers about the capabilities (and limitations) as well as proper use of these and related methods. 63