Integration of heterogeneous omics data

Similar documents
Multivariate Methods to detecting co-related trends in data

Unravelling `omics' data with the mixomics R package

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

The Future of IntegrOmics

Research Powered by Agilent s GeneSpring

2017 HTS-CSRS COMMUNITY PUBLIC WORKSHOP

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Machine Learning in Computational Biology CSC 2431

Smart India Hackathon

Bioinformatics Analysis of Nano-based Omics Data

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Bioinformatics : Gene Expression Data Analysis

Our website:

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Lab 1: A review of linear models

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

Knowledge-Guided Analysis with KnowEnG Lab

CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Introduction to BIOINFORMATICS

iclusterplus: integrative clustering of multiple genomic data sets

GENOMICS for DUMMIES

Bioinformatics for Biologists

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Bioinformatics for Biologists

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

Our view on cdna chip analysis from engineering informatics standpoint

Pioneering Clinical Omics

Gene expression connectivity mapping and its application to Cat-App

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Introduction to Bioinformatics. Fabian Hoti 6.10.

Genetics and Bioinformatics

Integrative clustering methods for high-dimensional molecular data

Corporate Medical Policy

Single-cell sequencing

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Corporate Medical Policy

Corporate Medical Policy

The 150+ Tomato Genome (re-)sequence Project; Lessons Learned and Potential

G E N OM I C S S E RV I C ES

Stefano Monti. Workshop Format

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

ILLUMINA SEQUENCING SYSTEMS

Measuring and Understanding Gene Expression

DNA. Clinical Trials. Research RNA. Custom. Reports CLIA CAP GCP. Tumor Genomic Profiling Services for Clinical Trials

Centro Nacional de Análisis Genómico. Where are the Bottlenecks of Genome Analysis Today? Teratec. Ecole Polytechnique, Palaiseau, F.

Designing a Complex-Omics Experiments. Xiangqin Cui. Section on Statistical Genetics Department of Biostatistics University of Alabama at Birmingham

Statistical Methods for Network Analysis of Biological Data

Data-Adaptive Estimation and Inference in the Analysis of Differential Methylation

Data Mining for Biological Data Analysis

Ecological genomics and molecular adaptation: state of the Union and some research goals for the near future.

First steps in signal-processing level models of genetic networks: identifying response pathways and clusters of coexpressed genes

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Statistical Applications in Genetics and Molecular Biology

Characterization of Allele-Specific Copy Number in Tumor Genomes

Computational Challenges of Medical Genomics

Welcome to the NGS webinar series

Functional genomics + Data mining

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Introduction to Bioinformatics and Gene Expression Technology

IPA Advanced Training Course

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Bioinformatics. Outline of lecture

The EORTC Molecular Screening programme SPECTA

Cancer Genetics Solutions

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Machine Learning. HMM applications in computational biology

latestdevelopments relevant for the Ag sector André Eggen Agriculture Segment Manager, Europe

Microarrays & Gene Expression Analysis

Potential of human genome sequencing. Paul Pharoah Reader in Cancer Epidemiology University of Cambridge

Bioinformatics Advice on Experimental Design

Lees J.A., Vehkala M. et al., 2016 In Review

Analytics Behind Genomic Testing

Introduction to Bioinformatics

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Introducing QIAseq. Accelerate your NGS performance through Sample to Insight solutions. Sample to Insight

Discriminant models for high-throughput proteomics mass spectrometer data

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Statistical Inference and Reconstruction of Gene Regulatory Network from Observational Expression Profile

SEQUENCING. M Ataei, PhD. Feb 2016

Normalization of metabolomics data using multiple internal standards

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

Deep Sequencing technologies

Genomic solutions for complex disease

Syllabus for BIOS 101, SPRING 2013

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Introduction to Microarray Analysis

Biomarker discovery and high dimensional datasets

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Supplementary Methods

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner

Including prior knowledge in shrinkage classifiers for genomic data

Additional file 2. Figure 1: Receiver operating characteristic (ROC) curve using the top

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist

Microarray Informatics

Random matrix analysis for gene co-expression experiments in cancer cells

Integrative Genomics 1a. Introduction

Transcription:

Integration of heterogeneous omics data Andrea Rau March 11, 2016 Formation doctorale: Biologie expe rimentale animale et mode lisation pre dictive andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 1 / 36

Introduction Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 2 / 36

Integrative data analysis

Introduction Integromics Integrative omics data analysis ( integromics ) Public genome databases like NCBI already house petabytes (10 6 GB) of data, and are growing exponentially each year Increasingly difficult to extract full value from massive omics data in a unified and meaningful way: Gene expression (RNA-seq, microarrays) Protein expression Methylation Metabolome Copy number variants Genomic mutations Functional annotations Gene pathway membership Protein-protein interactions High-throughput phenotypic information Focusing on a single platform runs the risk of missing an obvious signal! andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 4 / 36

Introduction Integromics A relatively new phenomenon andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 5 / 36

Introduction Integromics The broad umbrella of integrative data analysis Ultimate goal: Understanding complex processes Lots of different meanings: Exploration Description Classification (supervised, unsupervised, semi-supervised) Variable selection / biomarker identification Phenotype prediction Meta-analysis... andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 6 / 36

Introduction Integromics Integrative multi-omics analysis: What? Why? 1 Exploration Multiple Factor Analysis (MFA) Regularized Canonical Correlation Multiple co-inertia analysis 2 Classification Clustering (iclusterplus) 3 Prediction Integrative lasso with Penalty Factors (IPF-Lasso) Multi-group partial least squares Penalized linear discriminant analysis andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 7 / 36

Introduction Integromics... with lots of statistical and practical difficulties! Missing or incomplete data Potentially heterogenous quality across datasets Need for normalization / standardization / preprocessing (???) Many (!!) more variables than observations (ultra-high dimensionality) Multiple testing Datasets of differing sizes Potentially large requirements for data storage and computing power... and of course, biological interpretation! andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 8 / 36

Introduction Example data: TCGA multi-omics data Introduction to the TCGA data Comprehensive and coordinated effort to improve the molecular understanding of major types and sub-types of cancer through high-throughput genomics Clinical information + genomic characterization data + high level sequence analysis of tumor genomes 34 cancer types/sub-types Open-access tier (public data not unique to individuals) and controlled-access tier (primary sequence data, raw SNP data) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 9 / 36

Introduction Example data: TCGA multi-omics data TCGA data (matched/unmatched tumor/normal samples) Clinical (demographic, treatment, survival information) mirna sequencing Protein expression mrna sequencing DNA methylation Copy number variants Somatic mutations Biospecimen data Diagnostic / tissue / radiological images Whole exome / genome sequencing Total RNA sequencing Array-based expression andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 10 / 36

Introduction Example data: TCGA multi-omics data TCGA breast cancer data For illustration, we make use of tumoral data from 104 patients with breast invasive carcinoma: Clinical information: cancer subtype (Basal, Luminal A, Luminal B, HER2-enriched), estrogen / progesterone status, survival time, pathologic stage, race, age,... Subtype: Basal-like HER2-enriched Luminal A Luminal B 22 18 44 20 ER status: Negative Positive 28 76 PR status Negative Positive 37 67 andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 11 / 36

Introduction Example data: TCGA multi-omics data TCGA breast cancer data For illustration, we make use of tumoral data from 104 patients with breast invasive carcinoma: mirna-seq (Illumina Hi-Seq): 725 mirs Normalized protein expression (reverse phase protein arrays): 156 proteins RNA-seq (Illumina Hi-Seq): 19738 genes Methylation (Infinium HumanMethylation27 BeadChip): 21123 genes Somatic mutations: 4398 genes Copy number alterations: 21670 genes andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 12 / 36

Descriptive integration with multiple factor analysis Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 13 / 36

Descriptive integration with multiple factor analysis Multi-table analyses Individuals are described by a set of (possibly related) variables that are structured into several groups: Several potential goals: Identify relationships between tables (inter-structure): canonical correlation Identify a consensus (common structure) among tables: multiple factor analysis (Escofier and Pagès, 1997) Borrow from multivariate methods developed for ecological/survey/chemometrics data andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 14 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis (MFA) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 15 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis (MFA) We seek common structures present in some or all of the data tables: Simultaneously deal with tables containing information on the same individuals...... but first, groups of variables must be made comparable! Balanced weighting of different groups of variables Differing numbers of variables in each group Type of variables (quantitative, categorial) may differ between groups andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 16 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis Four major steps: 1 Perform principal components analysis (PCA) on each dataset individually 2 Normalize each dataset by dividing its elements by the square root of the first eigenvalue obtained from step 1 3 Merge normalized data, and perform a global PCA on the merged data 4 Project individual datasets onto the global analysis to analyze commonalities and discrepancies andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 17 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis Superposed graphical representation of partial PCAs 1 1 http://factominer.free/docs/afm.pdf andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 18 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis 2 for TCGA data 3 via ade4 Measure of proximity between each data table and the consensus = projected inertia from each table on the first two MFA axes 2 All MFA graphics courtesy of Denis Laloë 3 Pre-processing: log 2 ( + 1) for RNA-seq and mirna-seq, arcsin( ) for methylation andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 19 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis for TCGA data Similarity between MFA and individual PCA results andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 20 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis for TCGA data Similarity between MFA and individual PCA results andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 20 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis for TCGA data Projection of data tables onto consensus andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 21 / 36

Clustering integration with icluster+ Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 22 / 36

Clustering integration with icluster+ Integrative clustering Goal: discover new phenotype subgroups (e.g., cancer subtypes) and their molecular drivers in a comprehensive genetic context Jointly model discrete and continuous variables arising from genomic/epigenomic/transcriptomic profiling Hypothesis: diverse molecular phenotypes can be predicted by a set of orthogonal latent variables 4 representing distinct molecular drivers 4 = not directly observable andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 23 / 36

Clustering integration with icluster+ icluster+ integrative clustering andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 24 / 36

Clustering integration with icluster+ icluster+ integrative clustering Integrates binary (mutation), categorical (copy number gain/normal/loss), continuous or count (gene expression) data Generalized linear regression for joint model, with common set of latent variables 5 + penalization via lasso terms: f (X t ) = β t Z + E t where X t is the p t n data matrix for data type t, β t the loading matrix, Z the shared K n latent variables, and E t the uncorrelated Gaussian error terms Assume Z i N (0, I K ) Sparse model obtained via data-specific lasso penalties λ t 5 NOTE: similar to PCA but better suited to heteroscedastic data andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 25 / 36

Clustering integration with icluster+ A word on sparse methods High-dimensional data often contain many irrelevant variables for predicting a response / assigning observations to a group Including these irrelevant variables in a predictive model leads to a loss in predictive performance Sparse methods add an appropriate penalty term to the objective function of the method to suppress these irrelevant variables andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 26 / 36

Clustering integration with icluster+ icluster+ integrative clustering Let x ijt be the j th genomic feature in sample i of data type t. If x ijt is binary (i.e., mutation statuts): log P(x ijt = 1 Z i ) 1 P(x ijt = 1 Z i ) = α jt + β jt Z i andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 27 / 36

Clustering integration with icluster+ icluster+ integrative clustering Let x ijt be the j th genomic feature in sample i of data type t. If x ijt is binary (i.e., mutation statuts): log P(x ijt = 1 Z i ) 1 P(x ijt = 1 Z i ) = α jt + β jt Z i If x ijt is categorical (i.e., copy number status: loss/normal/gain): P(x ijt = c Z i ) = exp(α jct + β jct Z i ) c exp(α jct + β jct Z i ), c = 1,..., C andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 27 / 36

Clustering integration with icluster+ icluster+ integrative clustering Let x ijt be the j th genomic feature in sample i of data type t. If x ijt is binary (i.e., mutation statuts): log P(x ijt = 1 Z i ) 1 P(x ijt = 1 Z i ) = α jt + β jt Z i If x ijt is categorical (i.e., copy number status: loss/normal/gain): P(x ijt = c Z i ) = exp(α jct + β jct Z i ) c exp(α jct + β jct Z i ), c = 1,..., C If x ijt is continuous (i.e., expression): x ijt = α jt + β jt Z i + ε ijt, ε ijt N(0, σ 2 jt) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 27 / 36

Clustering integration with icluster+ iclusterplus Bioconductor package Estimation via modified Monte Carlo Newton-Raphson algorithm Optimization of number of latent variables K (deviance ratio) and lasso penalty terms λ t (BIC) needed... andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 28 / 36

Clustering integration with icluster+ Preparing data for integrative clustering 6 Somatic mutation data: keep genes that have mutations in at least 2% of the samples RNA-seq data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance), and center data for each individual CNA data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance),set all values between -0.25 and 0.25 equal to 0 Protein data: keep all values 6 For now, only 4 datasets may be integrated in icluster+. andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 29 / 36

Clustering integration with icluster+ Preparing data for integrative clustering 6 Somatic mutation data: keep genes that have mutations in at least 2% of the samples RNA-seq data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance), and center data for each individual CNA data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance),set all values between -0.25 and 0.25 equal to 0 Protein data: keep all values Set K = 4 latent variables (equal to the number of cancer subtypes), use default values of lasso penalty parameters (λ t = 0.03 for all t) 6 For now, only 4 datasets may be integrated in icluster+. andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 29 / 36

Clustering integration with icluster+ icluster+ results (K = 4 latent variables) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 31 / 36

Clustering integration with icluster+ icluster+ results (K = 4 latent variables) Top features based on lasso penalized coefficients for each data type: $mutation "CDH1" "GATA3" "PCDH15" "PIK3CA" "RYR1" "TP53"... $protein "4E-BP1-R-V" "Akt_pS473-R-V" "AR-R-V" "Bcl-2-M-V" "Bim-R-V" "c-kit-r-v"... $rna "A2ML1 144568" "ABCA8 10351" "ABCC8 6833" "ADCY1 107" "ADH1B 125" "ADIPOQ 9370"... $tumor "ASIC1" "ACVR1B" "ACVRL1" "APOF" "AQP2" "AQP5"... > lapply(sigfeatures, length) $mutation [1] 46 $protein [1] 39 $rna [1] 250 $tumor [1] 247 andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 32 / 36

Discussion Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 33 / 36

Discussion Two major integrative strategies Description Variable symmetry No matrix inversions Multi-table analysis through MFA (supervised analysis possible between groups) Explanation / Prediction Asymmetry of variables: one group explains another group Matrix inversion Colinearity n < p and matrix ranks Regularization procedures needed Clustering via icluster+, supervised (discriminant) analysis via predictive methods like IPF-Lasso andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 34 / 36

Discussion Discussion Integrative predictive/explicative methods like iclusterplus seem very promising for integrative omics analysis...... but data preprocessing/model tuning is often needed (and not straightfoward to perfom) Choice of number of latent variables K Choice of lasso penalty terms Influence of pre-processing steps on results... andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 35 / 36

Discussion Discussion Integrative approaches can (should?) account for the intrinsic structures of biological relationships from different high-throughput platforms andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 36 / 36

R/Bioconductor Packages: Thank you! ade4: http://pbil.univ-lyon1.fr/ade4 Multiple factor analysis, multiple co-inertia analysis, STATIS FactomineR: http://factominer.free.fr Multiple factor analysis mixomics: http://mixomics.org Correlation analysis, partial least squares iclusterplus

Some references... Meng, C. et al (2014). A multivariate approach to the integration of multi-omics datasets. BMC Bioinformatics 15:162 de Tayrac, M. et al (2009). Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach. BMC genomics, 10(1), 32 Culhane, A. C., et al (2005). MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics, 21(11), 2789-2790 Dray, S. et Dufour, A-B. (2007). The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4). Escofier B., et Pags, J.(1998). Analyses factorielles simples et multiples. Dunod. Lebart, L., Piron, M, Morineau, A. (2006). Statistique exploratoire multidimensionnelle. Dunod. L Cao, K. A.,et al (2008). A sparse PLS for variable selection when integrating omics data. Statistical applications in genetics and molecular biology, 7(1). Salmi B. et al (2010). Multivariate analysis to compare pig meat quality traits according to breed and rearing system.proceedings of the 9th WCGALP, Leipzig, August 1-6, 2010, 442 Tenenhaus, A., et Tenenhaus, M. (2014). Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. European Journal of Operational Research, 238(2), 391-403