Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Similar documents
Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

Data Mining for Biological Data Analysis

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Epigenetics and DNase-Seq

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Learning Bayesian Network Models of Gene Regulation

Methods and tools for exploring functional genomics data

Pathways from the Genome to Risk Factors and Diseases via a Metabolomics Causal Network. Azam M. Yazdani, PhD

Microarrays & Gene Expression Analysis

Cancer Informatics. Comprehensive Evaluation of Composite Gene Features in Cancer Outcome Prediction

Serial Analysis of Gene Expression

Exploring the Genetic Basis of Congenital Heart Defects

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Feature Selection in Pharmacogenetics

Neural Networks and Applications in Bioinformatics

2. Materials and Methods

In silico prediction of novel therapeutic targets using gene disease association data

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

The Five Key Elements of a Successful Metabolomics Study

Machine Learning in Computational Biology CSC 2431

CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016

Enhancers mutations that make the original mutant phenotype more extreme. Suppressors mutations that make the original mutant phenotype less extreme

Workshop on Data Science in Biomedicine

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Our website:

2/10/17. Contents. Applications of HMMs in Epigenomics

BIOINFORMATICS THE MACHINE LEARNING APPROACH

Experimental design of RNA-Seq Data

Computational Systems Biology Deep Learning in the Life Sciences

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks

Exploring Long DNA Sequences by Information Content

Introduction to Algorithms in Computational Biology Lecture 1

Deep learning frameworks for regulatory genomics and epigenomics

Single-cell sequencing

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Gene Expression Data Analysis

Introduction to Quantitative Genomics / Genetics

Feature Selection of Gene Expression Data for Cancer Classification: A Review

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

VALLIAMMAI ENGINEERING COLLEGE

NGS Approaches to Epigenomics

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

Statistical Methods for Network Analysis of Biological Data

Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Machine Learning. HMM applications in computational biology

Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

Functional profiling of metagenomic short reads: How complex are complex microbial communities?

Introduction to Bioinformatics

Enabling Reproducible Gene Expression Analysis

Lecture 5: Regulation

Identifying Genes Underlying QTLs

ChIP-seq and RNA-seq

BTRY 7210: Topics in Quantitative Genomics and Genetics

2/19/13. Contents. Applications of HMMs in Epigenomics

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BTRY 7210: Topics in Quantitative Genomics and Genetics

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

From Proteomics to Systems Biology. Integration of omics - information

Gene Identification in silico

Applications of HMMs in Epigenomics

Introducing a Highly Integrated Approach to Translational Research: Biomarker Data Management, Data Integration, and Collaboration

Outline and learning objectives. From Proteomics to Systems Biology. Integration of omics - information

Parts of a standard FastQC report

3.1.4 DNA Microarray Technology

Precision Medicine: Its Promises and Pitfalls Melissa S. Creary, PhD, MPH Health Management and Policy U-M School of Public Health

Motif Finding: Summary of Approaches. ECS 234, Filkov

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Chapter 8. Microbial Genetics. Lectures prepared by Christine L. Case. Copyright 2010 Pearson Education, Inc.

RNA-SEQUENCING ANALYSIS

Nature Genetics: doi: /ng Supplementary Figure 1

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein

Motivation From Protein to Gene

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Modeling & Simulation in pharmacogenetics/personalised medicine

Transcription start site classification

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Improving coverage and consistency of MS-based proteomics. Limsoon Wong (Joint work with Wilson Wen Bin Goh)

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

In silico identification of transcriptional regulatory regions

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

BIOSTATISTICS AND MEDICAL INFORMATICS (B M I)

Enabling more sophisticated analysis of proteomic profiles. Limsoon Wong (Joint work with Wilson Wen Bin Goh)

27041, Week 02. Review of Week 01

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield

ChIP-seq and RNA-seq. Farhat Habib

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Frumkin, 2e Part 1: Methods and Paradigms. Chapter 6: Genetics and Environmental Health

Comparative eqtl analyses within and between seven tissue types suggest mechanisms underlying cell type specificity of eqtls

COS 597c: Topics in Computational Molecular Biology. DNA arrays. Background

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Transcription:

Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University

Reference Machine learning applications in genetics and genomics Nature Reviews Genetics 16, 321 332 (2015) doi:10.1038/nrg3920 Topics Generative vs discriminative models Incorporating prior knowledge Handling heterogeneous data Feature selection Imbalanced class sizes Performance measure Handling missing data Modelling dependence among examples (genes)

Generative and discriminative models Generative and discriminative models are different in their interpretability and prediction accuracy. the transcription factor binding problem

Generative vs discriminative models The generative modelling approach offers several compelling benefits. The generative description of the data implies that the model parameters have well-defined semantics relative to the generative process. As shown in the transcription factor binding problem, the model not only predicts the locations to which a given transcription factor binds but also explains why the transcription factor binds there. If we compare two different potential binding sites, we can see that the model prefers one site over another and also that the reason is, for example, the preference for an adenine rather than a thymine at position 7 of the motif. Generative models are frequently stated in terms of probabilities, and the probabilistic framework provides a principled way to handle problems like missing data. For example, it is still possible for a PSFM to make a prediction for a binding site where one or more of the bound residues is unknown. This is accomplished by probabilistically averaging over the missing bases. The output of the probabilistic framework has well-defined, probabilistic semantics, and this can be helpful when making downstream decisions about how much to trust a given prediction. The primary benefit of the discriminative modelling approach is that it probably achieves better performance than the generative modelling approach with infinite training data In practice, analogous generative and discriminative approaches often converge to the same solution generative approaches can sometimes perform better with limited training data. when the amount of labelled training data is reasonably large, the discriminative approach will tend to find a better solution

Incorporating prior knowledge A simple, principled method for putting a probabilistic prior on a position-specific frequency matrix involves augmenting the observed nucleotide counts with pseudocounts and then computing frequencies with respect to the sum. The magnitude of the pseudocount corresponds to the weight assigned to the prior.

Handling heterogeneous data These diverse data types can be 1) encoded into fixed-length features, 2) represented using pairwise similarities (that is, kernels) or 3) directly accommodated by a probability model (e.g., Bayesian network).

Feature selection In practice, it is important to distinguish among three distinct motivations for carrying out feature selection. we want to identify a very small set of features that yield the best possible classifier. For example, we may want to produce an inexpensive way to identify a disease phenotype on the basis of the measured expression levels of a handful of genes. Such a classifier, if it is accurate enough, might form the basis of an inexpensive clinical assay. we may want to use the classifier to understand the underlying biology. In this case, we want the feature selection procedure to identify only the genes with expression levels that are actually relevant to the task at hand in the hope that the corresponding functional annotations or biological pathways might provide insights into the aetiology of disease. We often simply want to train the most accurate possible classifier. In this case, we hope that the feature selection enables the classifier to identify and eliminate noisy or redundant features. Researchers are often disappointed to find that feature selection cannot optimally perform more than one of these three tasks simultaneously. Feature selection is especially important in the third case because the analysis of high-dimensional data sets, including genomic, epigenomic, proteomic or metabolomic data sets, suffers from the curse of dimensionality the general observation that many types of analysis become more difficult as the number of input dimensions (that is, data measurements) grows very large.

Feature selection approaches Three main categories: Wrappers, Filters, Embedded methods

Wrappers Evaluate feature sets; select a subset of features that gives the best accuracy Exhaustive search -> Exponential problem (M features, 2^M subsets) Search strategy Sequential forward selection (evaluates M(M+1)/2 instead of 2^M feature sets) Recursive backward elimination Simulated annealing.

Filters Replace evaluation of models with quick-to-compute statistics Examples of filtering criterion Mutual information with target variable Correlation with the target variable chi-square statistic

Imbedded methods The classifier performs feature selection as part of the learning procedure Example: the logistic LASSO

Imbalanced class sizes: Enhancer prediction problem A common stumbling block in many applications of machine learning to genomics is the large imbalance (or label skew) in the relative sizes of the groups being classified. Enhancer prediction problem: Starting with a set of 641 known enhancers, the genome can be broken up into 1,000-bp segments and each segment assigned a label ('enhancer' or 'not enhancer') on the basis of whether it overlaps with a known enhancer. This procedure produces 1,711 positive examples and around 3,000,000 negative examples 2,000 times as many negative examples as positive examples. Assume a classifier achieved an overall accuracy (that is, the percentage of predictions that were correct) of 99.9%. The accuracy seems good, but it is not an appropriate measure because a null classifier that simply predicts everything to be non-enhancers achieves nearly the same accuracy.

Enhancer prediction problem (Cont.) It is more appropriate to separately evaluate sensitivity (i.e., the fraction of enhancers detected) and precision (i.e., the percentage of predicted enhancers that are truly enhancers). The balanced classifier described above has a high precision (>99.9%) but a very low sensitivity of 0.5%. The behavior of the classifier can be improved by using all of the enhancers for training and then picking a random set of 49,000 non-enhancer positions as negative training examples. However, balancing the classes in this way results in the classifier learning to reproduce this artificially balanced ratio. The resulting classifier achieves much higher sensitivity (81%) but very poor precision (40%); thus, this classifier is not useful for finding enhancers that can be validated experimentally (too many false positives) It is possible to trade off sensitivity and precision while retaining the training power of a balanced training set by placing weights on the training examples. For example, using the balanced training set and weighting each negative example 36 times more than a positive example during training resulted in a sensitivity of 53% with a precision of 95%.

Performance measure Precision: true positives/(true positives + false positives) Recall (or Sensitivity): true positives/(true positives + false negatives) F1 score: the harmonic mean of precision and recall F1 = 2 * (precision * recall) / (precision + recall) Image credit: scikit-learn online documents

Performance measure In general, the most appropriate performance measure depends on the intended application of the classifier. For problems such as identifying which tissue a given cell comes from, it may be equally important to identify rare and abundant tissues, and so the overall number of correct predictions may be the most informative measure of performance. In other problems, such as enhancer detection, predictions in one class may be more important than predictions in another. For example, if positive predictions will be published (and/or experimentally validated), the most appropriate measure may be the sensitivity among a set of predictions with a predetermined precision (for example, 95%). A wide variety of performance measures are used in practice, including the F 1 measure, the receiver operating characteristic (ROC) curve and the precision-recall curve, among others. Machine learning classifiers perform best when they are optimized for a realistic performance measure.

Handling missing data Missing data values missing at random or for reasons that are unrelated to the task at hand values that, when absent, provide information about the task at hand (e.g., patients become too sick) Different ways to deal with missing data Impute the missing values replacing all of the missing values with zero Or use a more sophisticated strategy (e.g., Troyanskaya et al used the correlations between data values to impute missing microarray values) Include in the model information about the 'missingness' of each data point. For example, Kircher et al. aimed to predict the deleteriousness of mutations based on functional genomic data. For each feature, the authors added a Boolean feature that indicated whether the corresponding feature value was present. The missing values themselves were then replaced with zeroes. An advantage of this approach is that it is applicable regardless of whether the absence of a data point is significant if it is not, the model will learn to ignore the absence indicator. Use probability models to explicitly model missing data by considering all the potential missing values. Missing data points are handled by summing over all possibilities for that random variable in the model. This approach, called marginalization, represents the case in which a particular variable is unobserved. However, marginalization is only appropriate when data points are missing for reasons that are unrelated to the task at hand. When the presence or absence of a data point is likely to be correlated with the values themselves, incorporating presence or absence explicitly into the model is more appropriate.

Modelling dependence among examples Methods that infer each relationship in a network separately, such as by computing the correlation between each pair, can be confounded by indirect relationships. Methods that infer the network as a whole can identify only direct relationships. Inferring the direction of causality inherent in networks is generally more challenging than inferring the network structure

Moving forward On the one hand, machine learning methods, which are most effective in the analysis of large, complex data sets, are likely to become ever more important to genomics as more large data sets become available through international collaborative projects, such as the 1000 Genomes Project, the 100,000 Genomes Project, ENCODE, the Roadmap Epigenomics Project and the US National Institutes of Health's 4D Nucleome Initiative. On the other hand, even in the presence of massive amounts of data, machine learning techniques are not generally useful when applied in an arbitrary manner. In practice, achieving good performance from a machine learning method usually requires theoretical and practical knowledge of both machine learning methodology and the particular research application area. As new technologies for generating large genomic and proteomic data sets emerge pushing beyond DNA sequencing to mass spectrometry, flow cytometry and highresolution imaging methods demand will increase not only for new machine learning methods but also for experts that are capable of applying and adapting them to big data sets. In this sense, both machine learning itself and scientists proficient in these applications are likely to become increasingly important to advancing genetics and genomics. Nature Reviews Genetics 16, 321 332 (2015) doi:10.1038/nrg3920