Machine Learning for Toxicogenetics and Drug Response Prediction

Machine Learning for Toxicogenetics and Drug Response Prediction Jean-Philippe Vert Festival of Genomics, San Mateo, Nov 6, 2015

Molecular stratification Diagnosis, prognosis, drug response prediction,...

Machine learning formulation n(= 19) >> p(= 2) : easy

Challenge: n << p n = 10 2 10 4 (patients) p = 10 4 10 7 (genes, mutations, copy number,...) Accuracy drops,

Outline 1 Learning molecular signatures with network information 2 Multitask learning for toxicogenetics

Feature selection (a.k.a. molecular signature)

Example: Breast cancer prognostic signature

But... 70 genes (Nature, 2002) 76 genes (Lancet, 2005) 3 genes in common

3 genes is the best you can expect given n and p The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures Anne-Claire Haury 1,2,3 *, Pierre Gestraud 1,2,3, Jean-Philippe Vert 1,2,3 1 Mines ParisTech, Centre for Computational Biology, Fontainebleau, France, 2 Institut Curie, Paris, France, 3 Institut National de la Santé et de la Recherche Médicale, Paris, France Stability 0.2 0.15 0.1 0.05 0 Abstract Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Surprisingly, complex wrapper and embedded methods generally do not outperform simple univariate feature selection methods, and ensemble feature selection has generally no positive effect. Overall a simple Student s t-test seems to provide the best results. Citation: Haury A-C, Gestraud P, Vert J-P (2011) The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures. PLoS ONE 6(12): e28210. doi:10.1371/journal.pone.0028210 Random T test Entropy Bhatt. Wilcoxon RFE GFS Lasso E Net Editor: Muy-Teck Teh, Barts & The London School of Medicine and Dentistry, Queen Mary University of London, United Kingdom Received July 11, 2011; Accepted November 3, 2011; Published December 21, 2011 Copyright: ß 2011 Haury et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by the French National Research Agency under grant agreement ANR-09-BLAN-0051-04 and by the European Community s Seventh Framework Programme under grant agreement FP7-246513. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: anne-claire.haury@mines-paristech.fr Introduction Biomarker discovery from high-dimensional data, such as transcriptomic or SNP profiles, is a crucial problem with enormous applications in biology and medicine, such as diagnosis, prognosis, patient stratification in clinical trials or prediction of the response to a given treatment. Numerous studies have for example investigated so-called molecular signatures, i.e., predictive models based on the expression of a small number of genes, for the stratification of early breast cancer patients into low-risk or highrisk of relapse, in order to guide the need for adjuvant therapy [1]. While predictive models could be based on the expression of more than a few tens of genes, several reasons motivate the search for short lists of predictive genes. First, from a statistical and machine learning perspective, restricting the number of variables is often a way to reduce over-fitting when we learn in high dimension from few samples and can thus lead to better predictions on new samples. Second, from a biological viewpoint, level of signature stability, meaning in particular that no biological insight should be expected from the analysis of current signatures. On the positive side, some authors noticed that the biological functions captured by different signatures are similar, in spite of the little overlap between them at the gene level [6 8]. From a machine learning point of view, estimating a signature from a set of expression data is a problem of feature selection, an active field of research in particular in the high-dimensional setting [9]. While the limits of some basic methods for feature selection have been highlighted in the context of molecular signatures, such as gene selection by Pearson correlation with the output [5], there are surprisingly very few and only partial investigations that focus on the influence of the feature selection method on the performance and stability of the signature [10]. compared various feature selection methods in terms of predictive performance only, and [11] suggest that ensemble feature selection improves both stability and accuracy of SVM recursive feature elimination (RFE), without 0.56 0.58 0.6 0.62 0.64 0.66 AUC Single run Ensemble mean Ensemble exp Ensemble ss

Gene networks as prior knowledge N Glycan biosynthesis Glycolysis / Gluconeogenesis Porphyrin and chlorophyll metabolism Riboflavin metabolism Sulfur metabolism - Protein kinases Nitrogen, asparagine metabolism Folate biosynthesis Biosynthesis of steroids, ergosterol metabolism DNA and RNA polymerase subunits Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis Oxidative phosphorylation, TCA cycle Purine metabolism Can we force the signatures to be "coherent" with a known gene network?

Network-driven structured feature selection (Jacob et al., 2009) 1 Using the network, define a non-smooth and convex subset of "candidate" signatures compatible with it Ω(β) = sup α β. α R p : i j, α 2 i +α2 j 1 2 Among the candidates, find the best signature that explains the data (efficient optimization through convex programming)

Lasso signature (accuracy 0.61) Breast cancer prognosis

Graph Lasso signature (accuracy 0.64) Breast cancer prognosis

Outline 1 Learning molecular signatures with network information 2 Multitask learning for toxicogenetics

Toxicogenetics Genotypes from the 1000 genome project RNASeq from the Geuvadis project

Again, n << p n = 5E4 p = 1E10

Crowd-sourcing: the DREAM8 Toxicogenetics challenge

Bilinear regression Cell line X, chemical Y, toxicity Z. Bilinear regression model: Z = f (X, Y ) + b(y ) + ɛ, Estimation by kernel ridge regression: min f H,b R m n i=1 j=1 Solved in O(max(n, p) 3 ) m ( ) 2 f (xi, y j ) + b j z ij + λ f 2,

Kernel Trick cell line descriptors drug descriptors mercredi 6 novembre 13

Kernel Trick Kcell cell line descriptors kernelized Kdrug drug descriptors jeudi 7 novembre 13

Kernel Trick Kcell cell line descriptors kernelized Kdrug kernel bilinear regression ˆf drug descriptors jeudi 7 novembre 13

Kernel Trick Kcell cell line descriptors kernelized Kernel choice?. descriptors. data integration. missing data Kdrug kernel bilinear regression ˆf drug descriptors jeudi 7 novembre 13

Kernel choice 1 K cell : = 29 cell line kernels tested = 1 kernel that integrate all information = deal with missing data 2 K drug : = 48 drug kernels tested = multi-task kernels

Cell line data integration Covariates. linear kernel SNPs. 10 gaussian kernels RNA-seq. 10 gaussian kernels mercredi 6 novembre 13

Cell line data integration Covariates. linear kernel SNPs. 10 gaussian kernels }Integrated kernel RNA-seq. 10 gaussian kernels mercredi 6 novembre 13

Multi-task drug kernels 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated independent regression for each drug

Multi-task drug kernels 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated sharing information across drugs

Multi-task drug kernels 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated Linear kernel and 10 gaussian kernels based on features: CDK (160 descriptors) and SIRMS (9272 descriptors) Graph kernel for molecules (2D walk kernel) Fingerprint of 2D substructures (881 descriptors) Ability to bind human proteins (1554 descriptors)

Multi-task drug kernels Count 0 100 Color Key and Histogram Empirical correlation 0 0.4 0.8 Value 1 Dirac 2 Multi-Task 3 Feature-based 4 Empirical 5 Integrated

Multi-task drug kernels 1 Dirac 2 Multi-Task K int = i K i 3 Feature-based 4 Empirical 5 Integrated Integrated kernel: Combine all information on drugs

29x48 kernel combinations: CV results Count 0 150 0.5 0.51 0.52 0.53 Value CI KsnpRbf6.txt KsnpRbf7.txt KsnpRbf8.txt KrnaseqRbf1.txt KsnpRbf5.txt KsnpRbf4.txt KcovariatesSex.txt KsnpRbf2.txt KsnpRbf3.txt KsnpRbf1.txt KrnaseqRbf3.txt KrnaseqRbf2.txt KsnpMean uniform dirac KrnaseqRbf6.txt KrnaseqRbf7.txt KrnaseqRbf5.txt KrnaseqRbf8.txt KrnaseqRbf10.txt KrnaseqRbf9.txt KrnaseqRbf4.txt KrnaseqMean KcovariatesBatch.txt KcovariatesPopulation.txt Kint Kcovariates.txt Kmultitask11 Ksubstructure.txt Kchemcpp.txt Kmultitask1 KsirmsRbf1.txt KcdkRbf1.txt KpredtargetRbf1.txt KsirmsRbf2.txt Kmultitask3 Kmultitask2 KpredtargetRbf2.txt KcdkRbf2.txt KpredtargetRbf3.txt KcdkRbf3.txt Kmultitask4 KsirmsRbf3.txt Kmultitask5 KcdkRbf8.txt KsirmsRbf8.txt KpredtargetRbf8.txt KpredtargetRbf6.txt KpredtargetRbf7.txt Kmultitask7 Kmultitask10 KpredtargetRbf4.txt KpredtargetRbf5.txt KcdkRbf4.txt KcdkRbf6.txt KcdkRbf5.txt KcdkRbf7.txt KsirmsRbf6.txt KsirmsRbf7.txt Kempirical Kint Kmultitask6 Kmultitask8 Kmultitask9 KsirmsRbf4.txt KsirmsRbf5.txt KpredtargetMean KcdkMean KsirmsMean Color Key and Histogram

29x48 kernel combinations: CV results jeudi 7 novembre 13 Count 0 150 0.5 0.51 0.52 0.53 Value CI KsnpRbf6.txt KsnpRbf7.txt KsnpRbf8.txt KrnaseqRbf1.txt KsnpRbf5.txt KsnpRbf4.txt KcovariatesSex.txt KsnpRbf2.txt KsnpRbf3.txt KsnpRbf1.txt KrnaseqRbf3.txt KrnaseqRbf2.txt KsnpMean uniform dirac KrnaseqRbf6.txt KrnaseqRbf7.txt KrnaseqRbf5.txt KrnaseqRbf8.txt KrnaseqRbf10.txt KrnaseqRbf9.txt KrnaseqRbf4.txt KrnaseqMean KcovariatesBatch.txt KcovariatesPopulation.txt Kint Kcovariates.txt Kmultitask11 Ksubstructure.txt Kchemcpp.txt Kmultitask1 KsirmsRbf1.txt KcdkRbf1.txt KpredtargetRbf1.txt KsirmsRbf2.txt Kmultitask3 Kmultitask2 KpredtargetRbf2.txt KcdkRbf2.txt KpredtargetRbf3.txt KcdkRbf3.txt Kmultitask4 KsirmsRbf3.txt Kmultitask5 KcdkRbf8.txt KsirmsRbf8.txt KpredtargetRbf8.txt KpredtargetRbf6.txt KpredtargetRbf7.txt Kmultitask7 Kmultitask10 KpredtargetRbf4.txt KpredtargetRbf5.txt KcdkRbf4.txt KcdkRbf6.txt KcdkRbf5.txt KcdkRbf7.txt KsirmsRbf6.txt KsirmsRbf7.txt Kempirical Kint Kmultitask6 Kmultitask8 Kmultitask9 KsirmsRbf4.txt KsirmsRbf5.txt KpredtargetMean KcdkMean KsirmsMean Color Key and Histogram integrated and covariates kernels

Kernel on cell lines: CV results integrated kernel Mean CI for cell line kernels batch effect Kint Kcovariates.txt KcovariatesBatch.txt KcovariatesPopulation.txt KrnaseqRbf10.txt KrnaseqRbf9.txt KrnaseqRbf8.txt KrnaseqRbf6.txt KrnaseqRbf7.txt KrnaseqRbf5.txt KrnaseqRbf4.txt KrnaseqMean KrnaseqRbf3.txt KsnpRbf2.txt KsnpRbf3.txt KsnpRbf1.txt KrnaseqRbf2.txt KsnpMean KsnpRbf4.txt KcovariatesSex.txt KrnaseqRbf1.txt KsnpRbf5.txt KsnpRbf8.txt KsnpRbf6.txt KsnpRbf7.txt uniform dirac 0.50 0.51 0.52 0.53 0.54 jeudi 7 novembre 13

Kernel on drugs: CV results Mean CI for chemicals kernels KpredtargetRbf8.txt Kmultitask10 KpredtargetRbf7.txt KsirmsRbf8.txt KcdkRbf8.txt KpredtargetRbf6.txt KpredtargetRbf5.txt KsirmsRbf7.txt KsirmsRbf6.txt KpredtargetRbf4.txt Kempirical KcdkRbf7.txt KpredtargetRbf3.txt KpredtargetMean KcdkRbf5.txt KcdkRbf6.txt Kmultitask7 KcdkRbf4.txt Kmultitask9 Kmultitask8 KsirmsRbf5.txt KsirmsMean Kmultitask6 KcdkMean Kmultitask2 Kmultitask4 Kint KsirmsRbf4.txt Kmultitask5 Kmultitask3 KcdkRbf2.txt KsirmsRbf3.txt KcdkRbf3.txt KsirmsRbf1.txt KpredtargetRbf2.txt Kmultitask1 KsirmsRbf2.txt KcdkRbf1.txt KpredtargetRbf1.txt Ksubstructure.txt Kmultitask11 Kchemcpp.txt 0.50 0.51 0.52 0.53 0.54

Final Submission (ranked 2nd) Color Key Mean CI for chemicals kernels Count 0 100 and Histogram 0 0.4 0.8 Value Empirical correlation Empirical kernel on drugs KpredtargetRbf8.txt Kmultitask10 KpredtargetRbf7.txt KsirmsRbf8.txt KcdkRbf8.txt KpredtargetRbf6.txt KpredtargetRbf5.txt KsirmsRbf7.txt KsirmsRbf6.txt KpredtargetRbf4.txt Kempirical KcdkRbf7.txt KpredtargetRbf3.txt KpredtargetMean KcdkRbf5.txt KcdkRbf6.txt Kmultitask7 KcdkRbf4.txt Kmultitask9 Kmultitask8 KsirmsRbf5.txt KsirmsMean Kmultitask6 KcdkMean Kmultitask2 Kmultitask4 Kint KsirmsRbf4.txt Kmultitask5 Kmultitask3 KcdkRbf2.txt KsirmsRbf3.txt KcdkRbf3.txt KsirmsRbf1.txt KpredtargetRbf2.txt Kmultitask1 KsirmsRbf2.txt KcdkRbf1.txt KpredtargetRbf1.txt Ksubstructure.txt Kmultitask11 Kchemcpp.txt Integrated kernel on cell lines 0.50 0.51 0.52 0.53 0.54 vendredi 8 novembre 13

Conclusion Small n large p = regularized models with prior knowledge Heterogeneous data integration = kernel methods Performance remains often disappointing! Progress arise by small steps

CIFRE:!accord&de&communication&sur&le&programme&ITI&via&le&site& internet&de&l ANRT!(rubrique!"zoom!sur")! Promotion!aux!Rencontres!Universités!Entreprises!(RUE)! Echange!avec!Stéphane!Mallat!et!réflexion!sur!la!pertinence!du! cbio@mines-paristech.fr, u900@curie.fr, sandrine@stat.berkeley.edu programme!tronc!commun! Rencontre!avec!le!responsable!des!relations!internationales!de!la! National!Taiwan!University!à!l ESPCI:!présentation!d ITI! Contact!en!cours!pour!visite!d entreprise! Article&les&Echos& Création&du&logo&PSL/ITI&et&du&Diplôme&Supérieur&de&Recherche&et& d Innovation&de&PSL/ITI&(DSRI),&dépôt&INPI&en&cours.&!! 2/!MISE!EN!ŒUVRE!OPERATIONNELLE!DU!PROGRAMME! Thanks Point&d étape&iti&/&20&fevrier& &1er&JUILLET&2014&! C.SURIAM!!F.LEQUEUX!!!!