Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics

Similar documents
Bioinformatics Practical Course. 80 Practical Hours

Designing and benchmarking the MULTICOM protein structure prediction system

Molecular Modeling 9. Protein structure prediction, part 2: Homology modeling, fold recognition & threading

Recursive protein modeling: a divide and conquer strategy for protein structure prediction and its case study in CASP9

Data Mining for Biological Data Analysis

Ph.D. in Information and Computer Science (Area: Bioinformatics), University of California, Irvine, August, (Advisor: Dr.

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Methods and tools for exploring functional genomics data

Computational Methods for Protein Structure Prediction

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

Textbook Reading Guidelines

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology

Introduction to Bioinformatics

Protein Structure Prediction. christian studer , EPFL

CS273: Algorithms for Structure Handout # 5 and Motion in Biology Stanford University Tuesday, 13 April 2004

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

JPred and Jnet: Protein Secondary Structure Prediction.

BIOINFORMATICS Introduction

Protein Structure Prediction

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Protein structure. Wednesday, October 4, 2006

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction

SMISS: A protein function prediction server by integrating multiple sources

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Secondary Structure Prediction

Protein backbone angle prediction with machine learning. approaches

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btn069

NNvPDB: Neural Network based Protein Secondary Structure Prediction with PDB Validation

Challenging algorithms in bioinformatics

Protein 3D Structure Prediction

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

BIOINFORMATICS THE MACHINE LEARNING APPROACH

Klinisk kemisk diagnostik BIOINFORMATICS

proteins TASSER_low-zsc: An approach to improve structure prediction using low z-score ranked templates Shashi B. Pandit and Jeffrey Skolnick*

VALLIAMMAI ENGINEERING COLLEGE

Computational Methods for Protein Structure Prediction and Fold Recognition... 1 I. Cymerman, M. Feder, M. PawŁowski, M.A. Kurowski, J.M.

An Overview of Protein Structure Prediction: From Homology to Ab Initio

Combining PSSM and physicochemical feature for protein structure prediction with support vector machine

Protein structure prediction in 2002 Jack Schonbrun, William J Wedemeyer and David Baker*

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Creation of a PAM matrix

MOL204 Exam Fall 2015

Supplement for the manuscript entitled

Textbook Reading Guidelines

proteins PREDICTION REPORT Fast and accurate automatic structure prediction with HHpred

Protein structure. Predictive methods

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Microarrays & Gene Expression Analysis

Title: A topological and conformational stability alphabet for multi-pass membrane proteins

Exploring Similarities of Conserved Domains/Motifs

Protein single-model quality assessment by feature-based probability density functions

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Homework 4. Due in class, Wednesday, November 10, 2004

Are specialized servers better at predicting protein structures than stand alone software?

Prediction of noncoding RNAs with RNAz

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs

An Introduction to Protein Contact Prediction

Multi-class protein Structure Prediction Using Machine Learning Techniques

Bayesian Inference using Neural Net Likelihood Models for Protein Secondary Structure Prediction

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

CMSE 520 BIOMOLECULAR STRUCTURE, FUNCTION AND DYNAMICS

Inferring Gene Networks from Microarray Data using a Hybrid GA p.1

Improving Remote Homology Detection using a Sequence Property Approach

Neural Networks and Applications in Bioinformatics

Ranking Beta Sheet Topologies of Proteins

Bioinformatics : Gene Expression Data Analysis

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics

Representation in Supervised Machine Learning Application to Biological Problems

Practical lessons from protein structure prediction

Our view on cdna chip analysis from engineering informatics standpoint

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

Comparative Modeling Part 1. Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center

Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction. N. von Öhsen, I. Sommer, R. Zimmer

Pacific Symposium on Biocomputing 5: (2000)

Homology Modeling of the Chimeric Human Sweet Taste Receptors Using Multi Templates

proteins Interplay of I-TASSER and QUARK for template-based and ab initio protein structure prediction in CASP10 Yang Zhang 1,2 * INTRODUCTION

proteins Interplay of I-TASSER and QUARK for template-based and ab initio protein structure prediction in CASP10 Yang Zhang 1,2 * INTRODUCTION

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Template Based Modeling and Structural Refinement of Protein-Protein Interactions:

Classification and Learning Using Genetic Algorithms

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

BIOINFORMATICS Sequence to Structure

proteins Template-based protein structure prediction the last decade

Molecular Modeling Lecture 8. Local structure Database search Multiple alignment Automated homology modeling

Modeling gene expression data via positive Boolean functions

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Molecular Structures

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices

Machine Learning in Computational Biology CSC 2431

DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm

4/10/2011. Rosetta software package. Rosetta.. Conformational sampling and scoring of models in Rosetta.

Oracle Spreadsheet Add-In for Predictive Analytics for Life Sciences Problems

BIOINFORMATICS ORIGINAL PAPER

Transcription:

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics Jianlin Cheng, PhD Computer Science Department and Informatics Institute University of Missouri 2016 Free for Academic Use. Copyright @ Jianlin Cheng & original sources of some materials.

SVM Applications in Bioinformatics Cancer Classification using Gene Expression Data Protein Mutation Stability Prediction Protein Secondary Structure Prediction Protein Fold Recognition Protein Contact Map Prediction Protein Structure Classification

Project 4 Classify cancer using gene expression data by SVM SVM tools: SVM-light (C++) or Weka (Java) R and MatLab Reference: Golub et al, Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 1999.

Current Cancer Diagnosis A reliable and precise classification of tumors is essential for successful treatment of cancer. Current methods relies on the subjective interpretation of both clinical histopathological information with an eye toward placing tumors in currently accepted categories based on the tissue of origin of the tumor. However, clinical information can be misleading or incomplete. there is a wide spectrum in cancer morphology and many tumors are atypical or lack morphologic features, which results in diagnostic confusion. Jia Yi, 2005

Typical DNA Microarray Experiment

DNA Microarray-based Cancer Diagnosis Molecular diagnostics offer the promise of precise, objective, and systematic cancer classification Recently, DNA microarray tumor gene expression profiles have been used for cancer diagnosis. By allowing the monitoring of expression levels for thousands of genes simultaneously, such techniques will lead to a more complete understanding of the molecular variations among tumors, hence to a finer and more reliable classification.

Tumor Classification Types There are three main types of statistical problems associated with tumor classification: The identification of new tumor classes using gene expression profiles --- unsupervised learning. The classification of malignancies into known classes --- supervised learning. The identifications of marker genes that characterize the different tumor classes --- variable selection.

Source of Datasets (cont.) Leukemia dataset This dataset is the gene expression in two types of acute leukemias: ALL and AML. This study produced gene expression data for p=6,817 genes in n=72 mrna samples. 47 ALL (38 B-cell All,9 T-cell All) 25 AML http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi

The Universe of Protein Structures Christine Orengo 1997 Structures 5 1093-1108 B. Rost, 2005

Typical Folds Fold: connectivity or arrangement of secondary structure elements. NAD-binding Rossman fold 3 layers, a/b/a, parallel beta-sheet of 3 strands. Order: 321456 http://scop.berkeley.edu/rsgen.cgi?chime=1;pd=1b16;pc=a

Fold: TIM Beta-Alpha Barrel Contains parallel beta-sheet Barrel, closed. 8 strands. Order 1,2,3,4,5,6,7,8. http://scop.berkeley.edu/rsgen.cgi?chime=1;pd=1hti;pc=a

Helix Bundle (Human Growth Factor) http://scop.berkeley.edu/rsgen.cgi?chime=1;pd=1hgu

Fold: Beta Barrel http://scop.berkeley.edu/rsgen.cgi?chime=1;pd=1rbp

Fold: Lamda Repressor DNA Binding http://scop.berkeley.edu/rsgen.cgi?chime=1;pd=5cro;pc=0

Number of Unique Folds (defined by SCOP) in PDB 1,000 structures 500 2006 2000 1990 1980 year J. Pevsner, 2005

3D Structure Prediction Ab-Initio Structure Prediction Physical force field protein folding Contact map - reconstruction Template-Based Structure Prediction MWLKKFGINLLIGQSV Simulation Select structure with minimum free energy Query protein MWLKKFGINKH Fold Recognition Alignment Protein Data Bank Template

Template-Based Structure Prediction 1. Template Identification 2. Query-Template Alignment 3. Model Generation (Modeller, Sali and Blundell, 1993) 4. Model Evaluation 5. Model Refinement

Classic Fold Recognition Approaches Sequence - Sequence Alignment (Needleman and Wunsch, 1970. Smith and Waterman, 1981) Query ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL Template ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL Alignment (similarity) score Works for >40% sequence identity (Close homologs in protein family)

Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) Query Family ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Average Score Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)

Query Family Template Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) 12. n ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN A 0.4 C 0.1 W 0.5 1 2 n Position Specific Scoring Matrix Or Hidden Markov Model More sensitive for distant homologs in superfamily. (> 25% identity)

Classic Fold Recognition Approaches Profile - Profile Alignment (Rychlewski et al., 2000) Query Family ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL 1 2 n A 0.1 C 0.4 W 0.5 Template Family ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM 1 2 m A 0.3 C 0.5 W 0.2 More sensitive for very distant homologs. (> 15% identity)

Classic Fold Recognition Approaches Sequence - Structure Alignment (Threading) (Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994) Query MWLKKFGINLLIGQS. Fit Fitness Score Template Structure Useful for recognizing similar folds without sequence similarity. (no evolutionary relationship)

Integration of Complementary Approaches FR Server1 Query Meta Server Consensus (Lundstrom et al.,2001. Fischer, 2003) FR server2 FR server3 Internet 1. Reliability depends on availability of external servers 2. Make decisions on a handful candidates

Machine Learning Classification Approach Support Vector Machine (SVM) Class 1 Proteins Class 2 Class m Classify individual proteins to several or dozens of structure classes (Ding and Dubchk, 2001, Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004, Rangwala and Karypis,2005) Problem 1: can t scale up to thousands of protein classes Problem 2: doesn t provide templates for structure modeling

Machine Learning Information Retrieval Framework Query-Template Pair Relevance Function (e.g., SVM) Score 1 - + Score 2... Rank Score n Extract pairwise features Comparison of two pairs (four proteins) Relevant or not (one score) vs. many classes Ranking of templates (retrieval) Cheng and Baldi, Bioinformatics, 2006

Pairwise Feature Extraction Sequence / Family Information Features Cosine, correlation, and Gaussian kernel Sequence Sequence Alignment Features Palign, ClustalW Sequence Profile Alignment Features PSI-BLAST, IMPALA, HMMer, RPS-BLAST Profile Profile Alignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM Structural Features Secondary structure, solvent accessibility, contact map, betasheet topology

Top Ranked Features

Relevance Function: Support Vector Positive Pairs (Same Folds) Machine Learning Feature Space Negative Pairs (Different Folds) Support Vector Machine Training/Learning Training Data Set Hyperplane

Relevance Function: Support Vector Machine Learning (1) (2) Margin Margin f(x) = K is Gaussian Kernel:

Training and Cross-Validation Standard benchmark (Lindahl s dataset, 976 proteins) 976 x 975 query-template pairs (about 7,468 positives) Query 1 2 3..... 976 975 pairs 975 pairs 975 pairs Query 1 s pairs Query 2 s pairs... (90%: 1-878) (10%: 879 976) Train / Learn Test Rank 975 templates for each query

Results for Top Five Ranked Templates Method Family Superfamily Fold PSI-BLAST 72.3 27.9 4.7 HMMER 73.5 31.3 14.6 SAM-T98 75.4 38.9 18.7 BLASTLINK 78.9 4.06 16.5 SSEARCH 75.5 32.5 15.6 SSHMM 71.7 31.6 24 THREADER 58.9 24.7 37.7 FUGUE 85.8 53.2 26.8 RAPTOR 77.8 50 45.1 SPARKS3 86.8 67.7 47.4 FOLDpro 89.9 70.0 48.3 Family: close homologs, more identity Superfamily: distant homologs, less identity Fold: no evolutionary relation, no identity

Advantages of MLIR Framework Integration, Accuracy, Extensibility Simplicity, Completeness, Potentials Disadvantages Slower than some alignment methods Challenge: analogous fold recognition using machine learning ranking techniques

A. Fisher, 2005

2D: Contact Map Prediction 3D Structure 2D-Recursive Neural Network Support Vector Machine 1 2 3.... i....... n 2D Contact Map 1 2.... j..... n Distance Threshold = 8A o Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005 Cheng and Baldi. BMC Bioinformatics, 2007.

Definition of Contact Prediction Predict if any two residues i, j are in contact or not according to a distance threshold (8 Angstrom) Interested in short to long range contacts ( i-j >= 6) Use a window (size = 9) to encode the information about residue i and j, respectively Train on a training dataset and test on a test dataset.

Feature Extraction Local window features (20 * 9 * 2) Pairwise information feature (cosine, correlation, mutual information) Residue type feature (non-polar, polar, acidic, and basic) Central segment window features Protein information features (global composition, sequence length)

Kernel and Feature Selection Gaussion kernel seems to work well. However, we haven t tested other kernels thoroughly Feature selection should be able to improve the performance. However, we haven t conducted a thorough feature selection yet due to the limited computing power.

Results At break-even point, the sensitivity = specificity = 28% However, the accuracy varies according to the property of the individual proteins significantly. Contacts within beta-sheet is predicted with higher accuracy than that in alpha helices or between alpha helix and beta-sheet.

Results on 48 test proteins

One Example

Predicted VS True Contacts

How to Use Contacts to Reconstruct 3D Structure 3D structure prediction problem can be defined as a constrained optimization problem. Generate a 3D structure with minimum free energy subject to contact restraints and intrinsic biophysical constraints such as bond length.

Optimization Techniques CONFOLD distance geometry Gradient Descent (Modeller) Lattice Monte Carlo Sampling (TASSER) Simulated Annealing (Rosetta) Multi-Dimensional Scaling

Contact Prediction Software http://sysbio.rnet.missouri.edu/multicom_toolbox/ tools.html (svmcon 1.0, source code and executable) Reference: J. Cheng and Baldi. Improved residue contact prediction using support vector machine and a large feature set. BMC Bioinformatics, 2007.