Including prior knowledge in shrinkage classifiers for genomic data

Size: px
Start display at page:

Download "Including prior knowledge in shrinkage classifiers for genomic data"

Transcription

1 Including prior knowledge in shrinkage classifiers for genomic data Jean-Philippe Vert Mines ParisTech / Curie Institute / Inserm Statistical Genomics in Biomedical Research BIRS workshop, Banff, Canada, July 8-23, 2. Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff / 38

2 Cancer diagnosis Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 2 / 38

3 Cancer prognosis Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 3 / 38

4 Pattern recognition, aka supervised classification Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 4 / 38

5 Pattern recognition, aka supervised classification Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 4 / 38

6 Pattern recognition, aka supervised classification Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 4 / 38

7 Pattern recognition, aka supervised classification Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 4 / 38

8 Pattern recognition, aka supervised classification Challenges High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 5 / 38

9 Shrinkage estimators The problem Focus on a large family of classifiers, e.g., linear predictors f β (x) = β x For any candidate β quantify how "good" the linear function f β is on the training set with some empirical risk, e.g.: R(β) = n n l(f β (x i ), y i ). i= Choose β that achieves the minimium empirical risk, subject to some constraint: min β R(β) subject to Ω(β) C. Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 6 / 38

10 Which penalty to use? min β R(β) subject to Ω(β) C Ω(β) constrains the solution, "increases bias and decreases variance" Common choices are Ω(β) = p i= β2 i (ridge regression, SVM,...) Ω(β) = p i= β i (lasso, boosting,...) How to select/design Ω(β) for a specific problem? Idea: use prior knowledge to have "good guesses" in the constraint set ("do not increase bias too much") Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 7 / 38

11 Outline Cancer prognosis from DNA copy number variations 2 Diagnosis and prognosis from gene expression data 3 Conclusion Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 8 / 38

12 Outline Cancer prognosis from DNA copy number variations 2 Diagnosis and prognosis from gene expression data 3 Conclusion Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 9 / 38

13 Chromosomic aberrations in cancer Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff / 38

14 Comparative Genomic Hybridization (CGH) Motivation Comparative genomic hybridization (CGH) data measure the DNA copy number along the genome Very useful, in particular in cancer research Can we classify CGH arrays for diagnosis or prognosis purpose? Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff / 38

15 Aggressive vs non-aggressive melanoma Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 2 / 38

16 CGH array classification Prior knowledge For a CGH profile x R p, we focus on linear classifiers, i.e., the sign of : f β (x) = β x. We expect β to be sparse : not all positions should be discriminative piecewise constant : within a selected region, all probes should contribute equally Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 3 / 38

17 Promoting sparsity with the l penalty The l penalty (Tibshirani, 996; Chen et al., 998) The solution of is usually sparse. p min R(β) + λ β i β R p i= Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 4 / 38

18 Promoting piecewise constant profiles penalty The variable fusion penalty (Land and Friedman, 996) The solution of min R(β) + λ p β i+ β i β R p is usually piecewise constant. i= Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 5 / 38

19 Fused Lasso signal approximator (Tibshirani et al., 25) min β R p p p p (y i β i ) 2 + λ β i + λ 2 β i+ β i. i= i= First term leads to sparse solutions i= Second term leads to piecewise constant solutions Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 6 / 38

20 Fused lasso for supervised classification (Rapaport et al., 28) min β R p n ) p p l (y i, β x i + λ β i + λ 2 β i+ β i. i= i= i= where l is, e.g., the hinge loss l(y, t) = max( yt, ). Implementation When l is the hinge loss (fused SVM), this is a linear program -> up to p = 3 4 When l is convex and smooth (logistic, quadratic), efficient implementation with proximal methods -> up to p = 8 9 Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 7 / 38

21 Fused lasso for supervised classification (Rapaport et al., 28) min β R p n ) p p l (y i, β x i + λ β i + λ 2 β i+ β i. i= i= i= where l is, e.g., the hinge loss l(y, t) = max( yt, ). Implementation When l is the hinge loss (fused SVM), this is a linear program -> up to p = 3 4 When l is convex and smooth (logistic, quadratic), efficient implementation with proximal methods -> up to p = 8 9 Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 7 / 38

22 Example: predicting metastasis in melanoma Weight BAC Weight BAC Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 8 / 38

23 Extension: joint segmentation of many profiles Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 9 / 38

24 Fused group Lasso signal approximator Log-ratio Score min Y p β R β 2 + λ β i+ β i n p i= (a) (a).5.5!.5!.5!! Probe 8 2 (b) Probe (b).5.5!.5!.5!! Probe 8 2 (c) Probe (c) Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 2 / 38 o

25 Outline Cancer prognosis from DNA copy number variations 2 Diagnosis and prognosis from gene expression data 3 Conclusion Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 2 / 38

26 Molecular diagnosis / prognosis / theragnosis Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 22 / 38

27 Gene selection, signature The idea Why? We look for a limited set of genes that are sufficient for prediction. Equivalently, the linear classifier will be sparse Bet on sparsity: we believe the "true" model is sparse. Interpretation: we will get a biological interpretation more easily by looking at the selected genes. Satistics: this is one way to constrain the solution and reduce the complexity to allow learning. Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 23 / 38

28 But... Challenging the idea of gene signature We often observe little stability in the genes selected... Is gene selection the most biologically relevant hypothesis? What about thinking instead of "pathways" or "modules" signatures? Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 24 / 38

29 Gene networks N Glycan biosynthesis Glycolysis / Gluconeogenesis Porphyrin and chlorophyll metabolism Riboflavin metabolism Sulfur metabolism - Protein kinases Nitrogen, asparagine metabolism Folate biosynthesis Biosynthesis of steroids, ergosterol metabolism DNA and RNA polymerase subunits Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis Oxidative phosphorylation, TCA cycle Purine metabolism Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 25 / 38

30 Graph based penalty Prior hypothesis Genes near each other on the graph should have similar weigths. Two solutions (Rapaport et al., 27, 28) Ω spectral (β) = (β i β j ) 2, i j Ω graphfusion (β) = i j β i β j + i β i. Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 26 / 38

31 Graph based penalty Prior hypothesis Genes near each other on the graph should have similar weigths. Two solutions (Rapaport et al., 27, 28) Ω spectral (β) = (β i β j ) 2, i j Ω graphfusion (β) = i j β i β j + i β i. Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 26 / 38

32 Fig. Jean-Philippe 4. Global connection Vert (ParisTech) map of KEGG with mapped Shrinkage coefficients classifiers of the fordecision genomic function data obtained by applying BIRS workshop, a customary Banff linear SVM 27 / 38 Classifiers Rapaport et al N Glycan biosynthesis Glycolysis / Gluconeogenesis Porphyrin and chlorophyll metabolism Riboflavin metabolism Sulfur metabolism - Protein kinases Nitrogen, asparagine metabolism Folate biosynthesis Biosynthesis of steroids, ergosterol metabolism DNA and RNA polymerase subunits Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis Oxidative phosphorylation, TCA cycle Purine metabolism

33 Classifier Spectral analysis of gene expression profiles using gene networks a) b) Fig. 5. The glycolysis/gluconeogenesis pathways of KEGG with mapped coefficients of the decision function obtained by applying a customary linear SVM (a) and using high-frequency eigenvalue attenuation (b). The pathways are mutually exclusive in a cell, as clearly highlighted by Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 28 / 38

34 Selecting pre-defined groups of variables Group lasso (Yuan & Lin, 26) If groups of covariates are likely to be selected together, the l /l 2 -norm induces sparse solutions at the group level: Ω group (w) = g w g 2 Ω(w, w 2, w 3 ) = (w, w 2 ) 2 + w 3 2 Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 29 / 38

35 Graph lasso Hypothesis: selected genes should form connected components on the graph Two solutions (Jacob et al., 29): Ω group (β) = βi 2 + βj 2, i j Ω overlap (β) = sup α β. α R p : i j, α 2 i +α2 j Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 3 / 38

36 Graph lasso Hypothesis: selected genes should form connected components on the graph Two solutions (Jacob et al., 29): Ω group (β) = βi 2 + βj 2, i j Ω overlap (β) = sup α β. α R p : i j, α 2 i +α2 j Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 3 / 38

37 Overlap and group unity balls Balls for Ω G group ( ) (middle) and Ω G overlap ( ) (right) for the groups G = {{, 2}, {2, 3}} where w 2 is represented as the vertical coordinate. Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 3 / 38

38 Summary: Graph lasso vs kernel Graph lasso: Ω graph lasso (w) = i j w 2 i + w 2 j. constrains the sparsity, not the values Graph kernel Ω graph kernel (w) = (w i w j ) 2. i j constrains the values (smoothness), not the sparsity Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 32 / 38

39 Preliminary results Breast cancer data Gene expression data for 8, 4 genes in 295 breast cancer tumors. Canonical pathways from MSigDB containing 639 groups of genes, 637 of which involve genes from our study. Graph on the genes. METHOD l Ω G OVERLAP (.) ERROR.38 ±.4.36 ±.3 MEAN PATH. 3 3 METHOD l Ω graph (.) ERROR.39 ±.4.36 ±. AV. SIZE C.C..3.3 Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 33 / 38

40 Lasso signature Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 34 / 38

41 Graph Lasso signature Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 35 / 38

42 Outline Cancer prognosis from DNA copy number variations 2 Diagnosis and prognosis from gene expression data 3 Conclusion Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 36 / 38

43 Conclusion Modern machine learning methods for regression / classification lend themselves well to the integration of prior knowledge in the penalization / regularization function. Several computationally efficient approaches (structured LASSO, kernels...) Tight collaborations with domain experts can help develop specific learning machines for specific data Natural extensions for data integration Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 37 / 38

44 People I need to thank Franck Rapaport (now MSKCC), Emmanuel Barillot, Andrei Zynoviev Kevin Bleakley, Anne-Claire Haury(Institut Curie / ParisTech), Laurent Jacob (UC Berkeley) Guillaume Obozinski (INRIA) Jean-Philippe Vert (ParisTech) Shrinkage classifiers for genomic data BIRS workshop, Banff 38 / 38