Machine Learning for Toxicogenetics and Drug Response Prediction

Size: px
Start display at page:

Download "Machine Learning for Toxicogenetics and Drug Response Prediction"

Transcription

1 Machine Learning for Toxicogenetics and Drug Response Prediction Jean-Philippe Vert Festival of Genomics, San Mateo, Nov 6, 2015

2 Molecular stratification Diagnosis, prognosis, drug response prediction,...

3 Machine learning formulation n(= 19) >> p(= 2) : easy

4 Machine learning formulation n(= 19) >> p(= 2) : easy

5 Machine learning formulation n(= 19) >> p(= 2) : easy

6 Machine learning formulation n(= 19) >> p(= 2) : easy

7 Challenge: n << p n = (patients) p = (genes, mutations, copy number,...) Accuracy drops,

8 Outline 1 Learning molecular signatures with network information 2 Multitask learning for toxicogenetics

9 Outline 1 Learning molecular signatures with network information 2 Multitask learning for toxicogenetics

10 Feature selection (a.k.a. molecular signature)

11 Example: Breast cancer prognostic signature

12 But genes (Nature, 2002) 76 genes (Lancet, 2005) 3 genes in common

13 3 genes is the best you can expect given n and p The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures Anne-Claire Haury 1,2,3 *, Pierre Gestraud 1,2,3, Jean-Philippe Vert 1,2,3 1 Mines ParisTech, Centre for Computational Biology, Fontainebleau, France, 2 Institut Curie, Paris, France, 3 Institut National de la Santé et de la Recherche Médicale, Paris, France Stability Abstract Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Surprisingly, complex wrapper and embedded methods generally do not outperform simple univariate feature selection methods, and ensemble feature selection has generally no positive effect. Overall a simple Student s t-test seems to provide the best results. Citation: Haury A-C, Gestraud P, Vert J-P (2011) The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures. PLoS ONE 6(12): e doi: /journal.pone Random T test Entropy Bhatt. Wilcoxon RFE GFS Lasso E Net Editor: Muy-Teck Teh, Barts & The London School of Medicine and Dentistry, Queen Mary University of London, United Kingdom Received July 11, 2011; Accepted November 3, 2011; Published December 21, 2011 Copyright: ß 2011 Haury et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by the French National Research Agency under grant agreement ANR-09-BLAN and by the European Community s Seventh Framework Programme under grant agreement FP The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * anne-claire.haury@mines-paristech.fr Introduction Biomarker discovery from high-dimensional data, such as transcriptomic or SNP profiles, is a crucial problem with enormous applications in biology and medicine, such as diagnosis, prognosis, patient stratification in clinical trials or prediction of the response to a given treatment. Numerous studies have for example investigated so-called molecular signatures, i.e., predictive models based on the expression of a small number of genes, for the stratification of early breast cancer patients into low-risk or highrisk of relapse, in order to guide the need for adjuvant therapy [1]. While predictive models could be based on the expression of more than a few tens of genes, several reasons motivate the search for short lists of predictive genes. First, from a statistical and machine learning perspective, restricting the number of variables is often a way to reduce over-fitting when we learn in high dimension from few samples and can thus lead to better predictions on new samples. Second, from a biological viewpoint, level of signature stability, meaning in particular that no biological insight should be expected from the analysis of current signatures. On the positive side, some authors noticed that the biological functions captured by different signatures are similar, in spite of the little overlap between them at the gene level [6 8]. From a machine learning point of view, estimating a signature from a set of expression data is a problem of feature selection, an active field of research in particular in the high-dimensional setting [9]. While the limits of some basic methods for feature selection have been highlighted in the context of molecular signatures, such as gene selection by Pearson correlation with the output [5], there are surprisingly very few and only partial investigations that focus on the influence of the feature selection method on the performance and stability of the signature [10]. compared various feature selection methods in terms of predictive performance only, and [11] suggest that ensemble feature selection improves both stability and accuracy of SVM recursive feature elimination (RFE), without AUC Single run Ensemble mean Ensemble exp Ensemble ss

14 Gene networks as prior knowledge N Glycan biosynthesis Glycolysis / Gluconeogenesis Porphyrin and chlorophyll metabolism Riboflavin metabolism Sulfur metabolism - Protein kinases Nitrogen, asparagine metabolism Folate biosynthesis Biosynthesis of steroids, ergosterol metabolism DNA and RNA polymerase subunits Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis Oxidative phosphorylation, TCA cycle Purine metabolism Can we force the signatures to be "coherent" with a known gene network?

15 Network-driven structured feature selection (Jacob et al., 2009) 1 Using the network, define a non-smooth and convex subset of "candidate" signatures compatible with it Ω(β) = sup α β. α R p : i j, α 2 i +α2 j 1 2 Among the candidates, find the best signature that explains the data (efficient optimization through convex programming)

16 Lasso signature (accuracy 0.61) Breast cancer prognosis

17 Graph Lasso signature (accuracy 0.64) Breast cancer prognosis

18 Outline 1 Learning molecular signatures with network information 2 Multitask learning for toxicogenetics

19 Toxicogenetics Genotypes from the 1000 genome project RNASeq from the Geuvadis project

20 Again, n << p n = 5E4 p = 1E10

21 Crowd-sourcing: the DREAM8 Toxicogenetics challenge

22 Bilinear regression Cell line X, chemical Y, toxicity Z. Bilinear regression model: Z = f (X, Y ) + b(y ) + ɛ, Estimation by kernel ridge regression: min f H,b R m n i=1 j=1 Solved in O(max(n, p) 3 ) m ( ) 2 f (xi, y j ) + b j z ij + λ f 2,

23 Kernel Trick cell line descriptors drug descriptors mercredi 6 novembre 13

24 Kernel Trick Kcell cell line descriptors kernelized Kdrug drug descriptors jeudi 7 novembre 13

25 Kernel Trick Kcell cell line descriptors kernelized Kdrug kernel bilinear regression ˆf drug descriptors jeudi 7 novembre 13

26 Kernel Trick Kcell cell line descriptors kernelized Kernel choice?. descriptors. data integration. missing data Kdrug kernel bilinear regression ˆf drug descriptors jeudi 7 novembre 13

27 Kernel choice 1 K cell : = 29 cell line kernels tested = 1 kernel that integrate all information = deal with missing data 2 K drug : = 48 drug kernels tested = multi-task kernels

28 Kernel choice 1 K cell : = 29 cell line kernels tested = 1 kernel that integrate all information = deal with missing data 2 K drug : = 48 drug kernels tested = multi-task kernels

29 Cell line data integration Covariates. linear kernel SNPs. 10 gaussian kernels RNA-seq. 10 gaussian kernels mercredi 6 novembre 13

30 Cell line data integration Covariates. linear kernel SNPs. 10 gaussian kernels }Integrated kernel RNA-seq. 10 gaussian kernels mercredi 6 novembre 13

31 Multi-task drug kernels 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated independent regression for each drug

32 Multi-task drug kernels 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated sharing information across drugs

33 Multi-task drug kernels 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated Linear kernel and 10 gaussian kernels based on features: CDK (160 descriptors) and SIRMS (9272 descriptors) Graph kernel for molecules (2D walk kernel) Fingerprint of 2D substructures (881 descriptors) Ability to bind human proteins (1554 descriptors)

34 Multi-task drug kernels Count Color Key and Histogram Empirical correlation Value 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated

35 Multi-task drug kernels 1 Dirac 2 Multi-Task K int = i K i 3 Feature-based 4 Empirical 5 Integrated Integrated kernel: Combine all information on drugs

36 29x48 kernel combinations: CV results Count Value CI KsnpRbf6.txt KsnpRbf7.txt KsnpRbf8.txt KrnaseqRbf1.txt KsnpRbf5.txt KsnpRbf4.txt KcovariatesSex.txt KsnpRbf2.txt KsnpRbf3.txt KsnpRbf1.txt KrnaseqRbf3.txt KrnaseqRbf2.txt KsnpMean uniform dirac KrnaseqRbf6.txt KrnaseqRbf7.txt KrnaseqRbf5.txt KrnaseqRbf8.txt KrnaseqRbf10.txt KrnaseqRbf9.txt KrnaseqRbf4.txt KrnaseqMean KcovariatesBatch.txt KcovariatesPopulation.txt Kint Kcovariates.txt Kmultitask11 Ksubstructure.txt Kchemcpp.txt Kmultitask1 KsirmsRbf1.txt KcdkRbf1.txt KpredtargetRbf1.txt KsirmsRbf2.txt Kmultitask3 Kmultitask2 KpredtargetRbf2.txt KcdkRbf2.txt KpredtargetRbf3.txt KcdkRbf3.txt Kmultitask4 KsirmsRbf3.txt Kmultitask5 KcdkRbf8.txt KsirmsRbf8.txt KpredtargetRbf8.txt KpredtargetRbf6.txt KpredtargetRbf7.txt Kmultitask7 Kmultitask10 KpredtargetRbf4.txt KpredtargetRbf5.txt KcdkRbf4.txt KcdkRbf6.txt KcdkRbf5.txt KcdkRbf7.txt KsirmsRbf6.txt KsirmsRbf7.txt Kempirical Kint Kmultitask6 Kmultitask8 Kmultitask9 KsirmsRbf4.txt KsirmsRbf5.txt KpredtargetMean KcdkMean KsirmsMean Color Key and Histogram

37 29x48 kernel combinations: CV results jeudi 7 novembre 13 Count Value CI KsnpRbf6.txt KsnpRbf7.txt KsnpRbf8.txt KrnaseqRbf1.txt KsnpRbf5.txt KsnpRbf4.txt KcovariatesSex.txt KsnpRbf2.txt KsnpRbf3.txt KsnpRbf1.txt KrnaseqRbf3.txt KrnaseqRbf2.txt KsnpMean uniform dirac KrnaseqRbf6.txt KrnaseqRbf7.txt KrnaseqRbf5.txt KrnaseqRbf8.txt KrnaseqRbf10.txt KrnaseqRbf9.txt KrnaseqRbf4.txt KrnaseqMean KcovariatesBatch.txt KcovariatesPopulation.txt Kint Kcovariates.txt Kmultitask11 Ksubstructure.txt Kchemcpp.txt Kmultitask1 KsirmsRbf1.txt KcdkRbf1.txt KpredtargetRbf1.txt KsirmsRbf2.txt Kmultitask3 Kmultitask2 KpredtargetRbf2.txt KcdkRbf2.txt KpredtargetRbf3.txt KcdkRbf3.txt Kmultitask4 KsirmsRbf3.txt Kmultitask5 KcdkRbf8.txt KsirmsRbf8.txt KpredtargetRbf8.txt KpredtargetRbf6.txt KpredtargetRbf7.txt Kmultitask7 Kmultitask10 KpredtargetRbf4.txt KpredtargetRbf5.txt KcdkRbf4.txt KcdkRbf6.txt KcdkRbf5.txt KcdkRbf7.txt KsirmsRbf6.txt KsirmsRbf7.txt Kempirical Kint Kmultitask6 Kmultitask8 Kmultitask9 KsirmsRbf4.txt KsirmsRbf5.txt KpredtargetMean KcdkMean KsirmsMean Color Key and Histogram integrated and covariates kernels

38 29x48 kernel combinations: CV results Count Value CI KsnpRbf6.txt KsnpRbf7.txt KsnpRbf8.txt KrnaseqRbf1.txt KsnpRbf5.txt KsnpRbf4.txt KcovariatesSex.txt KsnpRbf2.txt KsnpRbf3.txt KsnpRbf1.txt KrnaseqRbf3.txt KrnaseqRbf2.txt KsnpMean uniform dirac KrnaseqRbf6.txt KrnaseqRbf7.txt KrnaseqRbf5.txt KrnaseqRbf8.txt KrnaseqRbf10.txt KrnaseqRbf9.txt KrnaseqRbf4.txt KrnaseqMean KcovariatesBatch.txt KcovariatesPopulation.txt Kint Kcovariates.txt Kmultitask11 Ksubstructure.txt Kchemcpp.txt Kmultitask1 KsirmsRbf1.txt KcdkRbf1.txt KpredtargetRbf1.txt KsirmsRbf2.txt Kmultitask3 Kmultitask2 KpredtargetRbf2.txt KcdkRbf2.txt KpredtargetRbf3.txt KcdkRbf3.txt Kmultitask4 KsirmsRbf3.txt Kmultitask5 KcdkRbf8.txt KsirmsRbf8.txt KpredtargetRbf8.txt KpredtargetRbf6.txt KpredtargetRbf7.txt Kmultitask7 Kmultitask10 KpredtargetRbf4.txt KpredtargetRbf5.txt KcdkRbf4.txt KcdkRbf6.txt KcdkRbf5.txt KcdkRbf7.txt KsirmsRbf6.txt KsirmsRbf7.txt Kempirical Kint Kmultitask6 Kmultitask8 Kmultitask9 KsirmsRbf4.txt KsirmsRbf5.txt KpredtargetMean KcdkMean KsirmsMean Color Key and Histogram sightly multi-task on drugs covariates kernel on cell lines

39 Kernel on cell lines: CV results integrated kernel Mean CI for cell line kernels batch effect Kint Kcovariates.txt KcovariatesBatch.txt KcovariatesPopulation.txt KrnaseqRbf10.txt KrnaseqRbf9.txt KrnaseqRbf8.txt KrnaseqRbf6.txt KrnaseqRbf7.txt KrnaseqRbf5.txt KrnaseqRbf4.txt KrnaseqMean KrnaseqRbf3.txt KsnpRbf2.txt KsnpRbf3.txt KsnpRbf1.txt KrnaseqRbf2.txt KsnpMean KsnpRbf4.txt KcovariatesSex.txt KrnaseqRbf1.txt KsnpRbf5.txt KsnpRbf8.txt KsnpRbf6.txt KsnpRbf7.txt uniform dirac jeudi 7 novembre 13

40 Kernel on drugs: CV results Mean CI for chemicals kernels KpredtargetRbf8.txt Kmultitask10 KpredtargetRbf7.txt KsirmsRbf8.txt KcdkRbf8.txt KpredtargetRbf6.txt KpredtargetRbf5.txt KsirmsRbf7.txt KsirmsRbf6.txt KpredtargetRbf4.txt Kempirical KcdkRbf7.txt KpredtargetRbf3.txt KpredtargetMean KcdkRbf5.txt KcdkRbf6.txt Kmultitask7 KcdkRbf4.txt Kmultitask9 Kmultitask8 KsirmsRbf5.txt KsirmsMean Kmultitask6 KcdkMean Kmultitask2 Kmultitask4 Kint KsirmsRbf4.txt Kmultitask5 Kmultitask3 KcdkRbf2.txt KsirmsRbf3.txt KcdkRbf3.txt KsirmsRbf1.txt KpredtargetRbf2.txt Kmultitask1 KsirmsRbf2.txt KcdkRbf1.txt KpredtargetRbf1.txt Ksubstructure.txt Kmultitask11 Kchemcpp.txt

41 Final Submission (ranked 2nd) Color Key Mean CI for chemicals kernels Count and Histogram Value Empirical correlation Empirical kernel on drugs KpredtargetRbf8.txt Kmultitask10 KpredtargetRbf7.txt KsirmsRbf8.txt KcdkRbf8.txt KpredtargetRbf6.txt KpredtargetRbf5.txt KsirmsRbf7.txt KsirmsRbf6.txt KpredtargetRbf4.txt Kempirical KcdkRbf7.txt KpredtargetRbf3.txt KpredtargetMean KcdkRbf5.txt KcdkRbf6.txt Kmultitask7 KcdkRbf4.txt Kmultitask9 Kmultitask8 KsirmsRbf5.txt KsirmsMean Kmultitask6 KcdkMean Kmultitask2 Kmultitask4 Kint KsirmsRbf4.txt Kmultitask5 Kmultitask3 KcdkRbf2.txt KsirmsRbf3.txt KcdkRbf3.txt KsirmsRbf1.txt KpredtargetRbf2.txt Kmultitask1 KsirmsRbf2.txt KcdkRbf1.txt KpredtargetRbf1.txt Ksubstructure.txt Kmultitask11 Kchemcpp.txt Integrated kernel on cell lines vendredi 8 novembre 13

42 Conclusion Small n large p = regularized models with prior knowledge Heterogeneous data integration = kernel methods Performance remains often disappointing! Progress arise by small steps

43 CIFRE:!accord&de&communication&sur&le&programme&ITI&via&le&site& internet&de&l ANRT!(rubrique!"zoom!sur")! Promotion!aux!Rencontres!Universités!Entreprises!(RUE)! Echange!avec!Stéphane!Mallat!et!réflexion!sur!la!pertinence!du! programme!tronc!commun! Rencontre!avec!le!responsable!des!relations!internationales!de!la! National!Taiwan!University!à!l ESPCI:!présentation!d ITI! Contact!en!cours!pour!visite!d entreprise! Article&les&Echos& Création&du&logo&PSL/ITI&et&du&Diplôme&Supérieur&de&Recherche&et& d Innovation&de&PSL/ITI&(DSRI),&dépôt&INPI&en&cours.&!! 2/!MISE!EN!ŒUVRE!OPERATIONNELLE!DU!PROGRAMME! Thanks Point&d étape&iti&/&20&fevrier& &1er&JUILLET&2014&! C.SURIAM!!F.LEQUEUX!!!!