In silico prediction of novel therapeutic targets using gene disease association data

Similar documents
Data Mining for Biological Data Analysis

Stock Price Prediction with Daily News

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Machine Learning Techniques For Particle Identification

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Big Data. Methodological issues in using Big Data for Official Statistics

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Maximizing opportunities towards achieving clinical success D R U G D I S C O V E R Y. Report Price Publication date

Week 1: Discovery Biology Basic knowledge and tools used in Research and Development

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

A new strategy for genetics & pharmacogenomics (GpGx) Robert M. Plenge, MD, PhD Vice President Head of Genetics & Pharmacogenomics

Practical Application of Predictive Analytics Michael Porter

Classification of DNA Sequences Using Convolutional Neural Network Approach

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Application of Deep Learning to Drug Discovery

1/27 MLR & KNN M48, MLR and KNN, More Simple Generalities Handout, KJ Ch5&Sec. 6.2&7.4, JWHT Sec. 3.5&6.1, HTF Sec. 2.3

Application of Deep Learning to Drug Discovery

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos

REIMAGINING DRUG DEVELOPMENT:

Determining Method of Action in Drug Discovery Using Affymetrix Microarray Data

Feature Selection in Pharmacogenetics

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

BIG DATA SKILLS: CHALLENGES FOR THE UNIVERSITY WORLD CREATING A NEW GENERATION OF DATA SCIENTISTS. Massimiliano Marcellino Bocconi University

From Profit Driven Business Analytics. Full book available for purchase here.

Successful and Faster Drug Development through Data Mining Dirk Belmans, Ph.D. SAS Belgium

Deep Dive into High Performance Machine Learning Procedures. Tuba Islam, Analytics CoE, SAS UK

Oracle Spreadsheet Add-In for Predictive Analytics for Life Sciences Problems

Knowledge-Guided Analysis with KnowEnG Lab

Matrix Factorization-Based Data Fusion for Drug-Induced Liver Injury Prediction

Recent years have witnessed an expansion in the disciplines encompassing drug

What is Evolutionary Computation? Genetic Algorithms. Components of Evolutionary Computing. The Argument. When changes occur...

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

2. Materials and Methods

TARGET VALIDATION. Maaike Everts, PhD (with slides from Dr. Suto)

Case Study: Dr. Jonny Wray, Head of Discovery Informatics at e-therapeutics PLC

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Dallas J. Elgin, Ph.D. IMPAQ International Randi Walters, Ph.D. Casey Family Programs APPAM Fall Research Conference

Data Mining Applications with R

Introduction to Random Forests for Gene Expression Data. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.

References. Introduction to Random Forests for Gene Expression Data. Machine Learning. Gene Profiling / Selection

Machine Learning in Computational Biology CSC 2431

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Presentation to the Committee on Accelerating Rare Disease Research and Orphan Product Development

2017 Qualifying Examination

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Prediction model of side effect in drug discovery and its implementation for Web application

Machine Learning in Pharmaceutical Research

Challenges and needs to re-use existing data in drug development

Synthetic vaccine research and development. Comprehensive and innovative synthetic biology solutions and technologies

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

PREDICTION AND SIMULATION OF MULTI-TARGET THERAPIES FOR TRIPLE NEGATIVE BREAST CANCER THROUGH A NETWORK-BASED DATA INTEGRATION APPROACH

Cellular Assays. A Strategic Market Analysis. Sample Slides

SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Modelli predittivi in radioterapia: modelli statistici vs Machine Learning

Introduction to Drug Development

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

CS6716 Pattern Recognition

Functional genomics + Data mining

Drug target identification. Enabling our pharmaceutical and biotech partners to effectively discover proteins or genes as novel targets

Copyright 2013, SAS Institute Inc. All rights reserved.

Bioinformatics : Gene Expression Data Analysis

Prostate Cancer Genetics: Today and tomorrow

Gene Expression Data Analysis

BioXplain The Alliance for Integrative Biology

Antibody Discovery at Evotec

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Lecture 6: Decision Tree, Random Forest, and Boosting

Pioneering Clinical Omics

LETTER OF INTENT Rapid Response: Canada 2019 Parkinson s & Related Diseases Round 2

LETTER OF INTENT Rapid Response: Canada 2019 Parkinson s & Related Diseases Round 2

- OMICS IN PERSONALISED MEDICINE

LETTER OF INTENT Rapid Response: Canada 2019 Parkinson s & Related Diseases

Nature Genetics: doi: /ng Supplementary Figure 1. Summary of genetic association data and their traits and gene mappings.

BIOINFORMATICS Introduction

BioXplain The Alliance for Integrative Biology The First Open Platform for Iterative, Predictive and Integrative Biology

Biomarker discovery. Enabling pharmaceutical and biotech partners to discover relevant biomarkers in diseases of interest

CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016

Computational Approaches to Analysis of DNA Microarray Data

Dynamic Advisor-Based Ensemble (dynabe): Case Study in Stock Trend Prediction of Critical Metal Companies

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

PharmaPerspectiveonCDx. DrGillian Ellison

How Targets Are Chosen. Chris Wayman 12 th April 2012

Our website:

From Bench to Bedside: Role of Informatics. Nagasuma Chandra Indian Institute of Science Bangalore

Bioinformatics for Biologists

Azure ML Studio. Overview for Data Engineers & Data Scientists

Bioinformatics for Biologists

BIOINFORMATICS THE MACHINE LEARNING APPROACH

Complex Adaptive Systems Forum: Transformative CAS Initiatives in Biomedicine

Course Agenda. Day One

Smart India Hackathon

Micar Innovation. Drug Discovery Factory for novel drug molecules

Statistical Analysis of Gene Expression Data Using Biclustering Coherent Column

PREDICTION OF CONCRETE MIX COMPRESSIVE STRENGTH USING STATISTICAL LEARNING MODELS

Exon Skipping. Wendy Erler Patient Advocacy Wave Life Sciences

Transcription:

In silico prediction of novel therapeutic targets using gene disease association data, PhD, Associate GSK Fellow Scientific Leader, Computational Biology and Stats, Target Sciences GSK Big Data in Medicine 04.07.2017

Challenges in pharma R&D Time and costs are increasing but success rate is declining 2

N molecules Relative cost (per molecule) Why focus on targets? Late phase failures cost (a lot) more 100 90 80 1200 1000 70 60 50 800 600 40 30 20 10 400 200 0 Lead discovery Lead optimization Pre-clinical FTIH Phase 2 Phase 3 0 Manhattan Institute, 2012 3

Potential targets Target validation Potential targets Target validation Rethink the drug discovery pipeline Spend more time and resources in target validation to reduce attrition in later phases Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3 Launch Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3 Launch 4

Target discovery and genetics evidence Cook et al., 2014; Nelson et al., 2015 40% of efficacy failures are due to poor linkage between target and disease. The proportion of drug mechanisms with direct genetic support increases significantly across the drug development pipeline. Selecting genetically supported targets could double the success rate in clinical development.

Open Targets A platform for therapeutic target identification and validation 6

Could it be as easy as spotting spam emails? Predicting therapeutic targets Is it possible to predict novel therapeutic targets using available gene disease association data? 7

A simple machine learning workflow Predict therapeutic targets only using gene disease association data Generate input data matrix Assign labels and split into training, test and prediction sets Exploratory data analysis Explore predicted targets across the drug discovery pipeline Evaluate best classifier performance on test set Tune, train and test classifiers using nested cross-validation Make predictions using best performing classifier Validate with literature text mining 8

Data sources and data processing Input data matrix generation Obtain all gene disease associations and supporting evidence from Open Targets platform. For all genes, create numeric features by taking the mean score across all diseases: Genetic associations (germline) Somatic mutations Significant gene expression changes Disease-relevant phenotype in animal model Pathway-level evidence Gather positive labels from Pharmaprojects: only consider targets with drugs currently on the market, in clinical trials or preclinical studies. Exclude targets with drugs withdrawn from market or whose development has been discontinued. 9

A positive unlabelled (PU) semi-supervised learning approach Split data into training, test and prediction set A semi-supervised framework with only positive labels is used: targets according to PharmaProjects constitute the positive class (P), while the rest of the proteome is used as the unlabelled class (U), containing both negatives and yet-to-be-discovered positive. All positive cases (1421) and an equal number of randomly selected unlabelled cases (2842 in total) are set apart for training (80%) and testing (20%). The remainder is kept as a prediction set where predictions from the final model will be made. 10

Dimensionality reduction reveals structure in the data t-distributed Stochastic Neighbour Embedding (t-sne) 11

What are the most important features? Chi-squared test + information gain 12

Nested cross-validation and bagging for tuning and model selection Tuning, training and testing four classifiers Four classifiers are independently tuned, trained and tested on the training set using a nested cross-validation strategy (4 inner rounds for parameter tuning and 4 outer rounds to assess performance): Random forest (tuned parameters: number of trees and number of features); Feed-forward neural network with single hidden layer (tuned parameters: size and decay); Support vector machine with radial kernel (tuned parameters: gamma and cost); Gradient boosting machine with AdaBoost exponential loss function (tuned parameters: number of trees and interaction depth). In PU learning, U contains both positive and negative cases, which results in classifier instability. Bagging (bootstrap aggregating) can improve the performance of instable classifiers by randomly resampling P and U with replacement (bootstrap) and then aggregating the results by majority voting: Bagging with 100 iterations was applied to the neural network, the support vector machine and the gradient boosting machine. Random forests are already a special case of bagging. 13

Evaluating classifiers performance Receiver operating characteristic curves AUC 0.76 14

Disease association evidence higher for more advanced targets Model predicts late-stage targets more easily than early-stage ones 15

Literature text mining validation of predictions Highly significant overlap between predictions and text mining results 16

Conclusions In silico predictions of novel therapeutic targets using gene disease association data The gene disease association data from Open Targets contains enough information to predict whether a protein can make a therapeutic target or not with decent accuracy (71%) Aside from standard cross-validation and testing, prediction results were also validated by mining the scientific literature for therapeutic targets and assessing the significance of the overlap. The ability of the neural network model to predict late stage targets with greater accuracy confirms that clear linkage between target and disease is essential to maximise chances of success in the clinic. Of the evidence types tested, animal models showing disease-relevant phenotypes, dysregulated gene expression in disease tissue and genetic associations between gene and disease appear as the most informative ones. 17

Acknowledgements Ian Dunham Philippe Sanseau Gautier Koscielny Giovanni Dall Olio Pankaj Agarwal Mark Hurle Steven Barrett Nicola Richmond Jin Yao 18

Thank you 19

Pharmaprojects An industry-wide drug development database 20

Exploratory data analysis reveals sparse data with little structure Hierarchical clustering + principal component analysis 21

Tune, train and test classifiers using cross-validation Decision tree classification criteria 22

Evaluating classifiers performance Performance measures for supervised learning 23

Neural network performance on independent test set Selected classifier with most balanced overall performance for further analyses Cross-validation Test Misclassification error 0.303 0.287 Accuracy 0.697 0.713 AUC 0.758 0.763 Recall/Sensitivity 0.610 0.638 Specificity 0.785 0.784 Precision 0.742 0.736 F1 Score 0.670 0.683 24

Tune, train and test classifiers using cross-validation Misclassification error 25

Evaluate best classifier performance on test set Confusion matrices Actual value Crossvalidation Prediction outcome Unknown Target Unknown 912 217 Target 445 700 Actual value Test Prediction outcome Unknown Target Unknown 225 67 Target 99 177 26

Split into training, test and prediction sets Assess the effect of randomly sampling from unlabelled class: Monte Carlo simulation 27

Tune, train and test classifiers using crossvalidation Precision recall curves 28

Tune, train and test classifiers using crossvalidation Overlap between predictions on training set Predicted targets Predicted non-targets In silico prediction of novel therapeutic targets using gene disease association data 29

Targets with lower disease association fail more often Majority of targets with discontinued programmes not predicted as targets 30

Generating predictions on remaining 15K genes Run model on prediction set (not used for training/testing) 31

Validate with literature text mining Assess the significance of the literature-based validation: permutation test 32