Integration of heterogeneous omics data

Integration of heterogeneous omics data Andrea Rau March 11, 2016 Formation doctorale: Biologie expe rimentale animale et mode lisation pre dictive andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 1 / 36

Introduction Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 2 / 36

Integrative data analysis

Introduction Integromics Integrative omics data analysis ( integromics ) Public genome databases like NCBI already house petabytes (10 6 GB) of data, and are growing exponentially each year Increasingly difficult to extract full value from massive omics data in a unified and meaningful way: Gene expression (RNA-seq, microarrays) Protein expression Methylation Metabolome Copy number variants Genomic mutations Functional annotations Gene pathway membership Protein-protein interactions High-throughput phenotypic information Focusing on a single platform runs the risk of missing an obvious signal! andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 4 / 36

Introduction Integromics A relatively new phenomenon andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 5 / 36

Introduction Integromics The broad umbrella of integrative data analysis Ultimate goal: Understanding complex processes Lots of different meanings: Exploration Description Classification (supervised, unsupervised, semi-supervised) Variable selection / biomarker identification Phenotype prediction Meta-analysis... andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 6 / 36

Introduction Integromics Integrative multi-omics analysis: What? Why? 1 Exploration Multiple Factor Analysis (MFA) Regularized Canonical Correlation Multiple co-inertia analysis 2 Classification Clustering (iclusterplus) 3 Prediction Integrative lasso with Penalty Factors (IPF-Lasso) Multi-group partial least squares Penalized linear discriminant analysis andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 7 / 36

Introduction Integromics... with lots of statistical and practical difficulties! Missing or incomplete data Potentially heterogenous quality across datasets Need for normalization / standardization / preprocessing (???) Many (!!) more variables than observations (ultra-high dimensionality) Multiple testing Datasets of differing sizes Potentially large requirements for data storage and computing power... and of course, biological interpretation! andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 8 / 36

Introduction Example data: TCGA multi-omics data Introduction to the TCGA data Comprehensive and coordinated effort to improve the molecular understanding of major types and sub-types of cancer through high-throughput genomics Clinical information + genomic characterization data + high level sequence analysis of tumor genomes 34 cancer types/sub-types Open-access tier (public data not unique to individuals) and controlled-access tier (primary sequence data, raw SNP data) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 9 / 36

Introduction Example data: TCGA multi-omics data TCGA data (matched/unmatched tumor/normal samples) Clinical (demographic, treatment, survival information) mirna sequencing Protein expression mrna sequencing DNA methylation Copy number variants Somatic mutations Biospecimen data Diagnostic / tissue / radiological images Whole exome / genome sequencing Total RNA sequencing Array-based expression andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 10 / 36

Introduction Example data: TCGA multi-omics data TCGA breast cancer data For illustration, we make use of tumoral data from 104 patients with breast invasive carcinoma: Clinical information: cancer subtype (Basal, Luminal A, Luminal B, HER2-enriched), estrogen / progesterone status, survival time, pathologic stage, race, age,... Subtype: Basal-like HER2-enriched Luminal A Luminal B 22 18 44 20 ER status: Negative Positive 28 76 PR status Negative Positive 37 67 andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 11 / 36

Introduction Example data: TCGA multi-omics data TCGA breast cancer data For illustration, we make use of tumoral data from 104 patients with breast invasive carcinoma: mirna-seq (Illumina Hi-Seq): 725 mirs Normalized protein expression (reverse phase protein arrays): 156 proteins RNA-seq (Illumina Hi-Seq): 19738 genes Methylation (Infinium HumanMethylation27 BeadChip): 21123 genes Somatic mutations: 4398 genes Copy number alterations: 21670 genes andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 12 / 36

Descriptive integration with multiple factor analysis Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 13 / 36

Descriptive integration with multiple factor analysis Multi-table analyses Individuals are described by a set of (possibly related) variables that are structured into several groups: Several potential goals: Identify relationships between tables (inter-structure): canonical correlation Identify a consensus (common structure) among tables: multiple factor analysis (Escofier and Pagès, 1997) Borrow from multivariate methods developed for ecological/survey/chemometrics data andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 14 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis (MFA) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 15 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis (MFA) We seek common structures present in some or all of the data tables: Simultaneously deal with tables containing information on the same individuals...... but first, groups of variables must be made comparable! Balanced weighting of different groups of variables Differing numbers of variables in each group Type of variables (quantitative, categorial) may differ between groups andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 16 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis Four major steps: 1 Perform principal components analysis (PCA) on each dataset individually 2 Normalize each dataset by dividing its elements by the square root of the first eigenvalue obtained from step 1 3 Merge normalized data, and perform a global PCA on the merged data 4 Project individual datasets onto the global analysis to analyze commonalities and discrepancies andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 17 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis Superposed graphical representation of partial PCAs 1 1 http://factominer.free/docs/afm.pdf andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 18 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis 2 for TCGA data 3 via ade4 Measure of proximity between each data table and the consensus = projected inertia from each table on the first two MFA axes 2 All MFA graphics courtesy of Denis Laloë 3 Pre-processing: log 2 ( + 1) for RNA-seq and mirna-seq, arcsin( ) for methylation andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 19 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis for TCGA data Similarity between MFA and individual PCA results andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 20 / 36

Descriptive integration with multiple factor analysis Multiple factor analysis for TCGA data Projection of data tables onto consensus andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 21 / 36

Clustering integration with icluster+ Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 22 / 36

Clustering integration with icluster+ Integrative clustering Goal: discover new phenotype subgroups (e.g., cancer subtypes) and their molecular drivers in a comprehensive genetic context Jointly model discrete and continuous variables arising from genomic/epigenomic/transcriptomic profiling Hypothesis: diverse molecular phenotypes can be predicted by a set of orthogonal latent variables 4 representing distinct molecular drivers 4 = not directly observable andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 23 / 36

Clustering integration with icluster+ icluster+ integrative clustering andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 24 / 36

Clustering integration with icluster+ icluster+ integrative clustering Integrates binary (mutation), categorical (copy number gain/normal/loss), continuous or count (gene expression) data Generalized linear regression for joint model, with common set of latent variables 5 + penalization via lasso terms: f (X t ) = β t Z + E t where X t is the p t n data matrix for data type t, β t the loading matrix, Z the shared K n latent variables, and E t the uncorrelated Gaussian error terms Assume Z i N (0, I K ) Sparse model obtained via data-specific lasso penalties λ t 5 NOTE: similar to PCA but better suited to heteroscedastic data andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 25 / 36

Clustering integration with icluster+ A word on sparse methods High-dimensional data often contain many irrelevant variables for predicting a response / assigning observations to a group Including these irrelevant variables in a predictive model leads to a loss in predictive performance Sparse methods add an appropriate penalty term to the objective function of the method to suppress these irrelevant variables andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 26 / 36

Clustering integration with icluster+ icluster+ integrative clustering Let x ijt be the j th genomic feature in sample i of data type t. If x ijt is binary (i.e., mutation statuts): log P(x ijt = 1 Z i ) 1 P(x ijt = 1 Z i ) = α jt + β jt Z i If x ijt is categorical (i.e., copy number status: loss/normal/gain): P(x ijt = c Z i ) = exp(α jct + β jct Z i ) c exp(α jct + β jct Z i ), c = 1,..., C andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 27 / 36

Clustering integration with icluster+ icluster+ integrative clustering Let x ijt be the j th genomic feature in sample i of data type t. If x ijt is binary (i.e., mutation statuts): log P(x ijt = 1 Z i ) 1 P(x ijt = 1 Z i ) = α jt + β jt Z i If x ijt is categorical (i.e., copy number status: loss/normal/gain): P(x ijt = c Z i ) = exp(α jct + β jct Z i ) c exp(α jct + β jct Z i ), c = 1,..., C If x ijt is continuous (i.e., expression): x ijt = α jt + β jt Z i + ε ijt, ε ijt N(0, σ 2 jt) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 27 / 36

Clustering integration with icluster+ iclusterplus Bioconductor package Estimation via modified Monte Carlo Newton-Raphson algorithm Optimization of number of latent variables K (deviance ratio) and lasso penalty terms λ t (BIC) needed... andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 28 / 36

Clustering integration with icluster+ Preparing data for integrative clustering 6 Somatic mutation data: keep genes that have mutations in at least 2% of the samples RNA-seq data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance), and center data for each individual CNA data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance),set all values between -0.25 and 0.25 equal to 0 Protein data: keep all values 6 For now, only 4 datasets may be integrated in icluster+. andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 29 / 36

Clustering integration with icluster+ Preparing data for integrative clustering 6 Somatic mutation data: keep genes that have mutations in at least 2% of the samples RNA-seq data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance), and center data for each individual CNA data: keep the 1000 most variable genes (i.e., those with the largest coefficient of variance),set all values between -0.25 and 0.25 equal to 0 Protein data: keep all values Set K = 4 latent variables (equal to the number of cancer subtypes), use default values of lasso penalty parameters (λ t = 0.03 for all t) 6 For now, only 4 datasets may be integrated in icluster+. andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 29 / 36

Clustering integration with icluster+ icluster+ results (K = 4 latent variables) andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 31 / 36

Clustering integration with icluster+ icluster+ results (K = 4 latent variables) Top features based on lasso penalized coefficients for each data type: $mutation "CDH1" "GATA3" "PCDH15" "PIK3CA" "RYR1" "TP53"... $protein "4E-BP1-R-V" "Akt_pS473-R-V" "AR-R-V" "Bcl-2-M-V" "Bim-R-V" "c-kit-r-v"... $rna "A2ML1 144568" "ABCA8 10351" "ABCC8 6833" "ADCY1 107" "ADH1B 125" "ADIPOQ 9370"... $tumor "ASIC1" "ACVR1B" "ACVRL1" "APOF" "AQP2" "AQP5"... > lapply(sigfeatures, length) $mutation [1] 46 $protein [1] 39 $rna [1] 250 $tumor [1] 247 andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 32 / 36

Discussion Outline 1 Introduction Integromics Example data: TCGA multi-omics data 2 Descriptive integration with multiple factor analysis 3 Clustering integration with icluster+ 4 Discussion andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 33 / 36

Discussion Two major integrative strategies Description Variable symmetry No matrix inversions Multi-table analysis through MFA (supervised analysis possible between groups) Explanation / Prediction Asymmetry of variables: one group explains another group Matrix inversion Colinearity n < p and matrix ranks Regularization procedures needed Clustering via icluster+, supervised (discriminant) analysis via predictive methods like IPF-Lasso andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 34 / 36

Discussion Discussion Integrative predictive/explicative methods like iclusterplus seem very promising for integrative omics analysis...... but data preprocessing/model tuning is often needed (and not straightfoward to perfom) Choice of number of latent variables K Choice of lasso penalty terms Influence of pre-processing steps on results... andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 35 / 36

Discussion Discussion Integrative approaches can (should?) account for the intrinsic structures of biological relationships from different high-throughput platforms andrea.rau@jouy.inra.fr Integration of heterogeneous omics data 36 / 36

R/Bioconductor Packages: Thank you! ade4: http://pbil.univ-lyon1.fr/ade4 Multiple factor analysis, multiple co-inertia analysis, STATIS FactomineR: http://factominer.free.fr Multiple factor analysis mixomics: http://mixomics.org Correlation analysis, partial least squares iclusterplus

Some references... Meng, C. et al (2014). A multivariate approach to the integration of multi-omics datasets. BMC Bioinformatics 15:162 de Tayrac, M. et al (2009). Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach. BMC genomics, 10(1), 32 Culhane, A. C., et al (2005). MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics, 21(11), 2789-2790 Dray, S. et Dufour, A-B. (2007). The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4). Escofier B., et Pags, J.(1998). Analyses factorielles simples et multiples. Dunod. Lebart, L., Piron, M, Morineau, A. (2006). Statistique exploratoire multidimensionnelle. Dunod. L Cao, K. A.,et al (2008). A sparse PLS for variable selection when integrating omics data. Statistical applications in genetics and molecular biology, 7(1). Salmi B. et al (2010). Multivariate analysis to compare pig meat quality traits according to breed and rearing system.proceedings of the 9th WCGALP, Leipzig, August 1-6, 2010, 442 Tenenhaus, A., et Tenenhaus, M. (2014). Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. European Journal of Operational Research, 238(2), 391-403