In silico prediction of novel therapeutic targets using gene disease association data

In silico prediction of novel therapeutic targets using gene disease association data, PhD, Associate GSK Fellow Scientific Leader, Computational Biology and Stats, Target Sciences GSK Big Data in Medicine 04.07.2017

Challenges in pharma R&D Time and costs are increasing but success rate is declining 2

N molecules Relative cost (per molecule) Why focus on targets? Late phase failures cost (a lot) more 100 90 80 1200 1000 70 60 50 800 600 40 30 20 10 400 200 0 Lead discovery Lead optimization Pre-clinical FTIH Phase 2 Phase 3 0 Manhattan Institute, 2012 3

Potential targets Target validation Potential targets Target validation Rethink the drug discovery pipeline Spend more time and resources in target validation to reduce attrition in later phases Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3 Launch Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3 Launch 4

Target discovery and genetics evidence Cook et al., 2014; Nelson et al., 2015 40% of efficacy failures are due to poor linkage between target and disease. The proportion of drug mechanisms with direct genetic support increases significantly across the drug development pipeline. Selecting genetically supported targets could double the success rate in clinical development.

Open Targets A platform for therapeutic target identification and validation 6

Could it be as easy as spotting spam emails? Predicting therapeutic targets Is it possible to predict novel therapeutic targets using available gene disease association data? 7

A simple machine learning workflow Predict therapeutic targets only using gene disease association data Generate input data matrix Assign labels and split into training, test and prediction sets Exploratory data analysis Explore predicted targets across the drug discovery pipeline Evaluate best classifier performance on test set Tune, train and test classifiers using nested cross-validation Make predictions using best performing classifier Validate with literature text mining 8

Data sources and data processing Input data matrix generation Obtain all gene disease associations and supporting evidence from Open Targets platform. For all genes, create numeric features by taking the mean score across all diseases: Genetic associations (germline) Somatic mutations Significant gene expression changes Disease-relevant phenotype in animal model Pathway-level evidence Gather positive labels from Pharmaprojects: only consider targets with drugs currently on the market, in clinical trials or preclinical studies. Exclude targets with drugs withdrawn from market or whose development has been discontinued. 9

A positive unlabelled (PU) semi-supervised learning approach Split data into training, test and prediction set A semi-supervised framework with only positive labels is used: targets according to PharmaProjects constitute the positive class (P), while the rest of the proteome is used as the unlabelled class (U), containing both negatives and yet-to-be-discovered positive. All positive cases (1421) and an equal number of randomly selected unlabelled cases (2842 in total) are set apart for training (80%) and testing (20%). The remainder is kept as a prediction set where predictions from the final model will be made. 10

Dimensionality reduction reveals structure in the data t-distributed Stochastic Neighbour Embedding (t-sne) 11

What are the most important features? Chi-squared test + information gain 12

Nested cross-validation and bagging for tuning and model selection Tuning, training and testing four classifiers Four classifiers are independently tuned, trained and tested on the training set using a nested cross-validation strategy (4 inner rounds for parameter tuning and 4 outer rounds to assess performance): Random forest (tuned parameters: number of trees and number of features); Feed-forward neural network with single hidden layer (tuned parameters: size and decay); Support vector machine with radial kernel (tuned parameters: gamma and cost); Gradient boosting machine with AdaBoost exponential loss function (tuned parameters: number of trees and interaction depth). In PU learning, U contains both positive and negative cases, which results in classifier instability. Bagging (bootstrap aggregating) can improve the performance of instable classifiers by randomly resampling P and U with replacement (bootstrap) and then aggregating the results by majority voting: Bagging with 100 iterations was applied to the neural network, the support vector machine and the gradient boosting machine. Random forests are already a special case of bagging. 13

Evaluating classifiers performance Receiver operating characteristic curves AUC 0.76 14

Disease association evidence higher for more advanced targets Model predicts late-stage targets more easily than early-stage ones 15

Literature text mining validation of predictions Highly significant overlap between predictions and text mining results 16

Conclusions In silico predictions of novel therapeutic targets using gene disease association data The gene disease association data from Open Targets contains enough information to predict whether a protein can make a therapeutic target or not with decent accuracy (71%) Aside from standard cross-validation and testing, prediction results were also validated by mining the scientific literature for therapeutic targets and assessing the significance of the overlap. The ability of the neural network model to predict late stage targets with greater accuracy confirms that clear linkage between target and disease is essential to maximise chances of success in the clinic. Of the evidence types tested, animal models showing disease-relevant phenotypes, dysregulated gene expression in disease tissue and genetic associations between gene and disease appear as the most informative ones. 17

Acknowledgements Ian Dunham Philippe Sanseau Gautier Koscielny Giovanni Dall Olio Pankaj Agarwal Mark Hurle Steven Barrett Nicola Richmond Jin Yao 18

Thank you 19

Pharmaprojects An industry-wide drug development database 20

Exploratory data analysis reveals sparse data with little structure Hierarchical clustering + principal component analysis 21

Tune, train and test classifiers using cross-validation Decision tree classification criteria 22

Evaluating classifiers performance Performance measures for supervised learning 23

Neural network performance on independent test set Selected classifier with most balanced overall performance for further analyses Cross-validation Test Misclassification error 0.303 0.287 Accuracy 0.697 0.713 AUC 0.758 0.763 Recall/Sensitivity 0.610 0.638 Specificity 0.785 0.784 Precision 0.742 0.736 F1 Score 0.670 0.683 24

Tune, train and test classifiers using cross-validation Misclassification error 25

Evaluate best classifier performance on test set Confusion matrices Actual value Crossvalidation Prediction outcome Unknown Target Unknown 912 217 Target 445 700 Actual value Test Prediction outcome Unknown Target Unknown 225 67 Target 99 177 26

Split into training, test and prediction sets Assess the effect of randomly sampling from unlabelled class: Monte Carlo simulation 27

Tune, train and test classifiers using crossvalidation Precision recall curves 28

Tune, train and test classifiers using crossvalidation Overlap between predictions on training set Predicted targets Predicted non-targets In silico prediction of novel therapeutic targets using gene disease association data 29

Targets with lower disease association fail more often Majority of targets with discontinued programmes not predicted as targets 30

Generating predictions on remaining 15K genes Run model on prediction set (not used for training/testing) 31

Validate with literature text mining Assess the significance of the literature-based validation: permutation test 32