Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Similar documents
Data-Adaptive Estimation and Inference in the Analysis of Differential Methylation

Analysis of a Proposed Universal Fingerprint Microarray

Gene Expression Data Analysis

Measuring methylation: from arrays to sequencing

Lab 1: A review of linear models

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Multiple Testing in RNA-Seq experiments

Exploration and Analysis of DNA Microarray Data

Heterogeneity of Variance in Gene Expression Microarray Data

Introduction to gene expression microarray data analysis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Introduction to Quantitative Genomics / Genetics

Single-cell sequencing

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Seven Keys to Successful Microarray Data Analysis

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

Mouse expression data were normalized using the robust multiarray algorithm (1) using

Supplementary Methods

Introduction to microarrays

Analysis of Microarray Data

Workshop on Data Science in Biomedicine

Syllabus for BIOS 101, SPRING 2013

MulCom: a Multiple Comparison statistical test for microarray data in Bioconductor.

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data

Microarray Informatics

Data Mining for Biological Data Analysis

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible

BIOSTATISTICS AND MEDICAL INFORMATICS (B M I)

STATISTICAL CHALLENGES IN GENE DISCOVERY

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

latestdevelopments relevant for the Ag sector André Eggen Agriculture Segment Manager, Europe

Simultaneous profiling of transcriptome and DNA methylome from a single cell

Computational Challenges of Medical Genomics

Standard Data Analysis Report Agilent Gene Expression Service

Selecting a Sample Size

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Using 2-way ANOVA to dissect the immune response to hookworm infection in mouse lung

Analysis of Microarray Data

GCTA/GREML. Rebecca Johnson. March 30th, 2017

Statistical Applications in Genetics and Molecular Biology

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

Multiple tests of association with biological annotation metadata arxiv: v1 [stat.ap] 20 May 2008

Gene Expression Data Analysis (I)

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA.

The application of hidden markov model in building genetic regulatory network

Measuring and Understanding Gene Expression

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Department of Economics, University of Michigan, Ann Arbor, MI

EECS730: Introduction to Bioinformatics

Feature selection methods for SVM classification of microarray data

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

Introduction to Bioinformatics

Designing Complex Omics Experiments

Total genomic solutions for biobanks. Maximizing the value of your specimens.

Functional genomics + Data mining

Designing a Complex-Omics Experiments. Xiangqin Cui. Section on Statistical Genetics Department of Biostatistics University of Alabama at Birmingham

Scoring pathway activity from gene expression data

The MiniSeq System. Explore the possibilities. Discover demonstrated NGS workflows for molecular biology applications.

Normalization of metabolomics data using multiple internal standards

Our website:

Sylvane Desrivières, PhD

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

elin grundberg JULY 2018 JK - PRESENTATION- 11 MINS [MIXED RESPONDENTS] [Other comments:]

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA.

Optimal alpha reduces error rates in gene expression studies: a meta-analysis approach

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Transcriptome analysis

Functional normalization of 450k methylation array data improves replication in large cancer studies

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

From Machine Learning to Personalized Medicine

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Indicator Development for Forest Governance. IFRI Presentation

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Detection and analysis of CpG sites with multimodal DNA methylation level distributions and their relationships with SNPs

Comparison of Microarray Pre-Processing Methods

Introduction to Bioinformatics

Bioinformatics : Gene Expression Data Analysis

Methylation Analysis of Next Generation Sequencing

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Case Study: Dr. Jonny Wray, Head of Discovery Informatics at e-therapeutics PLC

Microarray Gene Expression Analysis at CNIO

Microarray Informatics

Genomic solutions for complex disease

Additional file 2. Figure 1: Receiver operating characteristic (ROC) curve using the top

Alexander Statnikov, Ph.D.

Bioinformatics for Biologists

RNA

Analysis of Factors Affecting Resignations of University Employees

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Dose-Response Modeling of Gene Expression Data in Microarray Experiments

Using 2-way ANOVA to dissect gene expression following myocardial infarction in mice

Statistical Methods in Bioinformatics

Top 5 Lessons Learned From MAQC III/SEQC

Preprocessing of Microarray data I: Normalization and Missing Values. For example, a list of possible sources in spotted arrays:

Transcription:

Data-Adaptive Estimation and Inference in the Analysis of Differential Methylation for the annual retreat of the Center for Computational Biology, given 18 November 2017 Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi slides: goo.gl/xabp3q This slide deck is for a brief (about 15-minute) talk on a new statistical algorithm for using nonparametric and data-adaptive estimates of variable importance measures for differential methylation analysis. This talk was most recently given at the annual retreat of the Center for Computational Biology at the University of California, Berkeley. Source: https://github.com/nhejazi/talk_methyvim Slides: https://goo.gl/jdhseg With notes: https://goo.gl/xabp3q

Preview: Summary DNA methylation data is extremely high-dimensional we can collect data on 850K genomic sites with modern arrays! Normalization and QC are critical components of properly analyzing modern DNA methylation data. There are many choices of technique. A relative scarcity of techniques for estimation and inference exists analyses are often limited to the general linear model. Statistical causal inference provides an avenue for answering richer scientific questions, especially when combined with modern advances in machine learning. 1 We ll go over this summary again at the end of the talk. Hopefully, it will all make more sense then.

Motivation: Let s meet the data Observational study of the impact of disease state on DNA methylation. Phenotype-level quantities: 216 subjects, binary disease status (FASD) of each subject, background info on subjects (e.g., sex, age). Genomic-level quantities: 850, 000 CpG sites interrogated using the Infinium MethylationEPIC BeadChip by Illumina. Questions: How do disease status and differential methylation relate? Is a coherent biomarker-type signature detectable? 2 FASD is an abbreviation for Fetal Alcohol Spectrum Disorders. We re mostly interested in the interplay between disease and DNA methylation. In particular, we d like to construct some kind of importance score for CpG sites impacted by the exposure/disease of interest. Re: dimensionality, c.f., RNA-seq analyses are 30, 000 in dimension at the gene level.

DNA Methylation Figure: https://www.illumina.com/techniques/sequencing/ methylation-sequencing.html (source) 3 The biology of these structures is quite complicated: many different structures CpG islands, shores, etc. Both hypermethylation and hypomethylation are linked to disease states. It s especially important to examine these processes, keeping in mind that the process may be complex and non-monotonic.

Data analysis? Linear Models! Standard operating procedure: For each CpG site (g = 1,..., G), fit a linear model: E[y g ] = Xβ g Test the coefficent of interest using a standard t-test: t g = ˆβ g β g,h0 s g Such models are a matter of convenience: does ˆβ g answer our scientific questions? Perhaps not. Is consideration being given to whether the data could have been generated by a linear model? Perhaps not. 4 CpG sites are thought to function in networks. Treating them as acting independently is not faithful to the underlying biology. The linear model is a great starting point for analyses whne the data is generated using complex technology no need to make the analysis more complicated. That being said, the data is difficult and expensive to collect, so why restrict the scope of the questions we d like to ask.

Motivation: Science Before Statistics What is the effect of disease status on DNA methylation at a specific CpG site, controlling for the observed methylation status of the neighbors of the given CpG site? 5 Again, CpG sites are thought to function in networks. Treating them as acting independently is not faithful to the underlying biology. This means that we should take into account the methylation status of neighboring CpG sites when assessing differential methylation at a single site. This is a coherent scientific question that we can set out to answer statistically. It s motivated by the established science and possible to do with modern statistical methodology.

Data analysis? A Data-Adaptive Approach 1. Isolate a subset of CpG sites for which there is cursory evidence of differential methylation. 2. Assign CpG sites into neighborhoods (e.g., bp distance). If there are many neighbors, apply clustering (e.g., PAM) to select a subset. 3. Estimate variable importance measure (VIM) at each screened CpG site, with disease as intervention (A) and controlling for neighboring CpG sites (W). 4. Apply a variant of the Benjamini & Hochberg method for FDR control, accounting for initial screening. 6 Pre-screening is a critical step since we cannot perform computationally intensive estimation on all the sites. This is flexible just use your favorite method (as long as allows a ranking to be made). The variable importance step merely comes down to the creation of a score. We use TMLE to statistically estimate parameters from causal models. The procedure is general enough to accomodate any inference technique.

Pre-Screening Pick Your Favorite Method The estimation procedure is computationally intensive apply it only to sites that appear promising. Consider estimating univariate (linear) regressions of intervention on CpG methylation status. Fast, easy. Select CpG sites with a marginal p-value below, say, 0.01. Apply data-adaptive procedure to this subset. The modeling assumptions do not matter since the we won t be pursuing inference under such a model. Software implementation is extensible. Users are encouraged to add their own. (It s easy!) 7 We ll be adding to the available routines for pre-screening too! For now, we have limma, and more are on the way.

Too Many Neighbors? Clustering There are many options: k-means, k-medoids, etc., as well as many algorithmic solutions. For convenience, we use Partitioning Around Medoids (PAM), a well-established algorithm. With limited sample sizes, the number of neighboring sites that may be controlled for is limited. To faithfully answer the question of interest, choose the neighboring sites that are the most representative. This is an optional step it need only be applied when there is a large number of CpG sites in the neighborhood of the target CpG site. 8 The number of sites that we can control for is roughly a function of sample size. This impacts the definition of the parameter that we estimate, and allows enough flexibility to obtain either very local or more regional estimates.

Nonparametric Variable Importance Let s consider a simple target parameter: the average treatment effect (ATE): Ψ g (P 0 ) = E W,0 [E 0 [Y g A = 1, W g ] E 0 [Y g A = 0, W g ]] Under certain (untestable) assumptions, interpretable as difference in methylation at site g with intervention and, possibly contrary to fact, the same under no intervention, controlling for neighboring sites. Provides a nonparametric (model-free) measure for those CpG sites impacted by a discrete intervention. Let the choice of parameter be determined by our scientific question of interest. 9 By allowing scientific questions to inform the parameters that we choose to estimate, we can do a better job of actually answering the questions of interest to our collaborators. Further, we abandon the need to specify the functional relationship between our outcome and covariates; moreover, we can now make use of advances in machine learning.

Target Minimum Loss-Based Estimation We use targeted minimum loss-based estimation (TMLE), a method for inference in semiparametric infinite-dimensional statistical models. No need to specify a functional form or assume that we know the true data-generating distribution. Asymptotic linearity: Ψ g (P n) Ψ g (P 0 ) = 1 n n i=1 IC(O i ) + o P ( 1 n ) Limiting distribution: n(ψn Ψ) N(0, Var(D(P 0 ))) Statistical inference: Ψ n ± z α σn n 10 Under the additional ( ) condition that the remainder term R(ˆP, P 0 ) decays as o P 1 n, we have that ( ) Ψ n Ψ 0 = (P n P 0 ) D(P 0 ) + o P 1 n, which, by a central limit theorem, establishes a Gaussian limiting distribution for the estimator, with variance V(D(P 0 )), the variance of the efficient influence curve (canonical gradient) when Ψ admits an asymptotically linear representation. The above implies that Ψ n is a n-consistent estimator of Ψ, that it is asymptotically normal (as given above), and that it is locally efficient. This allows us to build Wald-type confidence intervals, where σ 2 n is an estimator of V(D(P 0 )). The estimator σ 2 n may be obtained using the bootstrap or computed directly via σ 2 n = 1 n n i=1 D2 ( Q n, g n )(O i )

Corrections for Multiple Testing Multiple testing corrections are critical. Without these, we systematically obtain misleading results. The Benjamini & Hochberg procedure for controlling the False Discovery Rate (FDR) is a well-established technique for addressing the multiple testing issue. We use a modified BH-FDR procedure to account for the pre-screening step of the proposed algorithm. This modified BH-FDR procedure for multi-stage analyses (FDR-MSA) works by adding a p-value of 1.0 for each site that did not pass pre-screening then performs BH-FDR as normal. 11 Note that FDR = E [ V R] = E [ V R R > 0] P(R > 0). BH-FDR procedure: Find ˆk = max{k : p (k) k M α} FDR-MSA will only incur a loss of power if the initial screening excludes variables that would have been rejected by the BH procedure when applied to the subset on which estimation was performed. BH-FDR control is a rank-based procedure, so we must assume that the pre-screening does not disrupt the ranking with respect to the estimation subset, which is provably true for screening procedures of a given type. MSA controls type I error with any procedure that is a function of only the type I error itself e.g., FWER. This does not hold for the FDR in complete generality.

Software package: R/methyvim Figure: https://bioconductor.org/packages/methyvim Variable importance for discrete interventions. Future releases will support continuous interventions. Take it for a test drive! 12 Contribute on GitHub. Reach out to us with questions and feature requests.

Data analysis the methyvim way Figure: http://code.nimahejazi.org/methyvim 13

Review: Summary DNA methylation data is extremely high-dimensional we can collect data on 850K genomic sites with modern arrays! Normalization and QC are critical components of properly analyzing modern DNA methylation data. There are many choices of technique. A relative scarcity of techniques for estimation and inference exists analyses are often limited to the general linear model. Statistical causal inference provides an avenue for answering richer scientific questions, especially when combined with modern advances in machine learning. 14 It s always good to include a summary.

References I Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289 300. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press. Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1):1 25. Smyth, G. K. (2005). LIMMA: linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor, pages 397 420. Springer Science & Business Media. 15 References II Tuglus, C. and van der Laan, M. J. (2009). Modified FDR controlling procedure for multi-stage analyses. Statistical Applications in Genetics and Molecular Biology, 8(1):1 15. van der Laan, M. J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media. van der Laan, M. J. and Rose, S. (2017). Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies. Springer Science & Business Media. 16

Acknowledgments Collaborators: Mark van der Laan Alan Hubbard Lab of Martyn Smith Lab of Nina Holland Rachael Phillips University of California, Berkeley Funding source: National Library of Medicine (of NIH): T32LM012417 17 Thank you. Slides: goo.gl/jdhseg Notes: goo.gl/xabp3q Source (repo): goo.gl/m5as73 stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi 18