Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Similar documents
Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Predicting Corporate Influence Cascades In Health Care Communities

When to Book: Predicting Flight Pricing

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Intro Logistic Regression Gradient Descent + SGD

Deep Recurrent Neural Network for Protein Function Prediction from Sequence

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Using Decision Tree to predict repeat customers

Representation in Supervised Machine Learning Application to Biological Problems

2. Materials and Methods

Molecular Evolution and Ecology. Martin Polz

Codon usage diversity in city microbiomes

Unravelling Airbnb Predicting Price for New Listing

Bioinformatics for Biologists

Exploring Similarities of Conserved Domains/Motifs

Data Analysis in Metabolomics. Tim Ebbels Imperial College London

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Model Selection, Evaluation, Diagnosis

Methods and tools for exploring functional genomics data

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Advanced analytics at your hands

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Methods for the phylogenetic inference from whole genome sequences and their use in Prokaryote taxonomy. M. Göker, A.F. Auch, H. P.

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos

Applications of Machine Learning to Predict Yelp Ratings

CBC Data Therapy. Metagenomics Discussion

SUPPLEMENTARY INFORMATION

Random forest for gene selection and microarray data classification

How to build and deploy machine learning projects

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks

2015 The MathWorks, Inc. 1

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang

Why learn sequence database searching? Searching Molecular Databases with BLAST

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner

Statistical Methods for Network Analysis of Biological Data

Airbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.

Exploring Long DNA Sequences by Information Content

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

Infectious Disease Omics

Bioinformatics : Gene Expression Data Analysis

Lab 1: A review of linear models

Fraud Detection for MCC Manipulation

Real Data Analysis at PNC

Accurate Campaign Targeting Using Classification Algorithms

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Functional profiling of metagenomic short reads: How complex are complex microbial communities?

Rank hotels on Expedia.com to maximize purchases

Single-cell sequencing

Introduction to Random Forests for Gene Expression Data. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.

References. Introduction to Random Forests for Gene Expression Data. Machine Learning. Gene Profiling / Selection

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU

Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets

Convex and Non-Convex Classification of S&P 500 Stocks

Machine Learning Techniques For Particle Identification

Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance.

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Linear model to forecast sales from past data of Rossmann drug Store

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

Introduction to Bioinformatics

Project Category: Life Sciences SUNet ID: bentyeh

Telomerase Gene Prediction Using Support Vector Machines

Ribotyping Easily Fills in for Whole Genome Sequencing to Characterize Food-borne Pathogens David Sistanich

Strength in numbers? Modelling the impact of businesses on each other

Localization Site Prediction for Membrane Proteins by Integrating Rule and SVM Classification

BIOINFORMATICS THE MACHINE LEARNING APPROACH

SUPPLEMENTARY INFORMATION

Ensemble Prediction of Intrinsically Disordered Regions in Proteins

Lees J.A., Vehkala M. et al., 2016 In Review

Genome 541! Unit 4, lecture 3! Genomics assays

Predicting Restaurants Rating And Popularity Based On Yelp Dataset

Bellerophon; a program to detect chimeric sequences in multiple sequence

Predicting Corporate 8-K Content Using Machine Learning Techniques

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification

National Occupational Standard

Predictive Modelling for Customer Targeting A Banking Example

Mate-pair library data improves genome assembly

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Machine Learning 2.0 for the Uninitiated A PRACTICAL GUIDE TO IMPLEMENTING ML 2.0 SYSTEMS FOR THE ENTERPRISE.

Gene Expression Data Analysis

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Ranking Potential Customers based on GroupEnsemble method

BUSINESS DATA MINING (IDS 572) Please include the names of all team-members in your write up and in the name of the file.

PREDICTION OF PIPE PERFORMANCE WITH MACHINE LEARNING USING R

Machine Learning Models for Sales Time Series Forecasting

Matrix Factorization-Based Data Fusion for Drug-Induced Liver Injury Prediction

Logistic Regression and Decision Trees

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Spectral Counting Approaches and PEAKS

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Gene Prediction Group

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Methods for comparing multiple microbial communities. james robert white, October 1 st, 2007

Transcription:

Predicting prokaryotic incubation times from genomic features Maeva Fincker - mfincker@stanford.edu Final report Introduction We have barely scratched the surface when it comes to microbial diversity. For every microorganism we can culture and study in lab, there exist at least another 50 in the environment that are labelled uncultivable. Although most of these uncultivable strains resist growing under laboratory conditions, recent advances in DNA sequencing has given us access to their genome. One of the main questions that bioinformatics in the field of microbiology is trying to answer is the following: how can we infer function from genomic information without access to experimental validation?[] This project focuses on the prediction of microbial growth rate (how fast can a microorganism divide) from genomic information. More specifically, counts of specific genomic features encoding microbial functions are extracted from annotated microbial genomes for which the incubation time in known. These features are then used as input variables for thee different classifiers (softmax classification, SVM and random forest) whose role is to find the most probable label among 6 classes (representing different incubation times). Previous work Literature review did not reveal many example of using machine learning tools to predict physiological characteristics at the level of an microorganism from genomic features. Lauro et al. used self-organizing map to cluster microorganisms based on genomic features and relate the cluster to types of metabolism.[2] Roller et al. showed that the number of 6S rrna operon correlates with growth rate and used it for supervised regression to predict bacterial reproductive strategy. Their results, that growth rate use be dictated by metabolic functions, are the basis for this project. [3] Datasets & Feature selection The growth rate of a microorganism describes how long it takes for a cell to divide. When a new organism is isolated, microbiologists report growth rates in various ways, which make direct comparisons difficult, if they even measure such a physiological trait at all. Since mining scientific literature for growth rates values could be a machine learning project on its own that I did not wish to tackle, I decided to use incubation time as a proxy for growth rate. Incubation times are reported by strain collections and represent how long approximately one has to wait to observe full growth of a microbe. The BacDive database [4] contains the incubation times of 232 microorganisms in 6 different bins of time: -2 days, 2-3 days, 3-7 days, 8-4 days, 0-4 days and > 4 days. Out of these 232 organisms, 783 have a complete genome (or ordered scaffold genome assembly) available on NCBI [5] but only 596 of these genomes have been properly annotated. Therefore, my dataset is composed of 596 genomes and their count 200 50 00 50 0 224 02 8 27 44 8 2 2 3 3 7 8 4 0 4 >4 incubation time (days) Fig. : histogram of the labels in the dataset

associated incubation time. Figure shows the histogram of the labels of my 596 examples. Classes are very unbalanced, with an over representation of class -2 and 0-4 days. Given the hypothesis that incubation time is dictated by metabolic capabilities and functions, I focused on extracting metabolically-relevant features from the genomes: counts of proteins belonging to each pfam families, length of the genome and number of 6S rrna operons. The Pfam families are clusters of homolog proteins sharing a similar function. Extraction pfam counts yielded a total of 468 features, which was too large a number of features when I only had access to 596 genomes. Keeping only Pfam features that appear in at least 3 genomes lowered the total number of features to 7549. Proteins from more than one pfam cluster can be responsible for the same metabolic function, making my dataset highly redundant. In order to further decrease the number of feature, I used a feature selection algorithm called Fast Correlation-Based Filter (fcbf) [6]. This algorithm uses symmetrical uncertainty (SU) of two variables as a measure of their correlation. Given two variables X and Y, SU is defined as two times the ratio of information gain of X given Y, over the sum of their entropy. It first computes the SU between each variable and the label over the dataset. Keeping only features whose SU with respect to the labels are above a threshold d, the algorithm then consider a pair of features and calculates their SU with respect to each other. If the SU between these two features is higher than the SU between either of them and the label (meaning they are more correlated to each other than to the label), the feature with the lowest SU with respect to the label is removed from the dataset. The algorithm loops over pairs of features until all remaining features have a higher SU with respect to the label than to each other. Using a threshold d of 0 (which means keeping all non redundant features that correlates with the labels), the Fast Correlation-Based Filter selected 20 features. Classifier building and selection were therefore carried out on two datasets of 596 examples each: the full dataset contained 7549 features and the filtered fcbf dataset contained 20 features. Both datasets were scaled (s.d. ), centered (mean 0) and BoxCox transformed to correct for skewness for downstream applications. The datasets were divided between training set (537 examples, 90%) and a test set (59 examples, 0%). Methods I built and compared 3 different classifiers: softmax classification with L-regularization, one vs. one SVM with RBF kernel, and random forest. Softmax classification: In softmax regression, the hypothesis (for a problem with K classes) takes the following form: And the likelihood function to maximize is: There is no close form for the maximum likelihood estimates and I used batch gradient descent to find them. Since I had a large number of features, I also tried adding a L regularization term to the likelihood function such that the function to maximize becomes: l( ) = l( )+C 2

The L regularization term brings down to zero the weights of unimportant variables, thereby effectively selecting features of importance. The model was fitted to both the full and the filtered training set and the best value for C for each dataset was selected via 0-fold cross-validation one the training set. Finally, the models were evaluated on the test set. SVM classification: SVMs are linear classifiers that use the hinge loss. They enable the use of kernels to separate data in a high-dimensional space that is not necessarily linearly separable in its own space. Using the RBF kernel, the loss function to minimize is: with the matrix being defined as and. The 2 model hyperparameters to optimize are therefore and. SVMs are inherently 2-classes classifiers. In order to use them for J-class problem, one can either train J one vs. all SVM or J(J-) one vs. one SVMs. Since my classes were extremely unbalanced, I decided to use one vs. one SVMs as it would decrease the impact of having very large and very small classes. Furthermore, I weighted each example depending on their label: examples from overrepresented classes had a small weight whereas examples from underrepresented classes had a large one (weights: -2" = 0.5, 2-3" =, 3-7" =.2, 8-4" = 3.4, 0-4" = 0.7, ">4" = 5.7). As for softmax regression, the model was fitted using both the full and the filtered dataset. The best values for lambda and gamma were selected with 0-fold cross-validation on the training set and the models with the best parameters were retrained on the full data-set before being evaluated on the test set. Random forest Random forest classifiers build a collection of ntree decision trees, each tree being computed over a subset of nex samples, sampled with replacement from the training set. For each tree, at each decision split, a random subset of mtry features is selected and the decision split only decided over this subset. At each split, the best feature and decision threshold to split upon is the feature that minimizes the Gini index. The Gini index IG is measure how often an example would be mislabeled under a certain probability (with fi = p(y = i)). I trained random forest classifiers on both the full and filtered training set and tried to optimize the number mtry of features selected at each split. I used the randomforest package in R to fit random forest models. Because the classes in my dataset were very unbalanced, accuracy was not a good measure of learning. Instead, I used a definition of and generalized to multi-class to optimize the hyperparameters: 3

Results Softmax regression The softmax regression model was fitted using batch gradient descent with batches of 00 examples for 00 iteration with a learning parameter of 0.05. Better and were obtained with the filtered dataset, indicating probably overfitting of the model to unimportant features in the full dataset. The L regularization hyper parameter was optimized using 0-fold cross-validation (CV) on the training set. Mean and of the CV are presented in figure 2. Best performances were achieved with a L parameter of 0.6 and evaluation on the filtered test set showed a of 0.43 and a of 0.44. 0.40 0.35 0.30 0.25 0.20 0. 0.2 0.3 0.4 Fig. 2: Softmax classification 0-fold CV of the hyper parameter over the training set L cost: 0 0.00 0. 0.3 0.6 0 000 2 3 4 8 feature selection fcbf none SVM 0.48 0.45 0.42 0.39 0.36 0.30 0.35 0.40 0.45 weighted no yes 0.4 0.3 0.2 0. 0.2 0.3 0.4 0.5 cost 0.00 0.0 0. 0 00 000 0000 200 400 e 04 600 800 A B C feature selection none fcbf 0.4 0.3 0.2 0. 0.2 0.3 0.4 0.5 Fig. 3: Precision and during SVM hyper parameters optimization A: influence of weights on SVM, B: CV of the cost on the training set, C: CV of RBF gamma on the training set (values in B and C are mean and during CV) gamma 0.000 0.0002 0.0003 0.0004 0.00 0.0 0.0357 0. 0 00 000 0000 e 04 2e 04 A first attempt at implementing one vs. all SVMs for this classification problem did not yield any useful result because of the imbalance of classes. Therefore, a one vs. one SVM classification approach was employed. All model fitting were carried out using batch gradient descent (batches of 00 examples, annealed learning rate of /sqrt(t) for 40*m iteration). I first tested the influence of weights on the model using gamma = /num features, and cost = (fig. 3). Using weights improved and by 6-0% on both the filtered and full dataset. The best cost value for the full dataset was 00 and 200 for the filtered dataset. Optimization of RBF gamma yielded the following values: 0.035 for the filtered dataset, and 0.0003 for the full dataset. Evaluation on the test set showed that the best model had a and of 0.48 and 0.48 for the full dataset, and of 0.52 and 0.46 for the filtered dataset. 4

Random Forest I trained random forest models with 2500 trees and variable number of features subsampled at each split. Optimization of the number of features for split decision yielded different results for the full and filtered dataset: 20 features worked best for the filtered dataset, whereas 5 features were better for the full dataset. The best random forest model yielded a of 0.48 and of 0.47 with the filtered dataset (fig. 4). 0.45 0.40 0.35 0.3 0.4 Number of variable sampled 5 0 20 Feature selection none fcbf Fig. 4: Random forest model - Precision and during optimization of the number of subsampled features A Model comparison and PCA All three models performed better with the filtered dataset but did not have vastly different classification powers, as the comparison of confusion matrix shows in figure X. All classifiers were able to classify fast and slow growing organisms (class 0: -2 and class 4: 0-4 days) relatively well but were not able to resolve medium growth-rate (classes 2, 3 ) or extremely slow organisms (class 5). PCA of the dataset shows a clear chasm between slow an fast growing organisms (class 0 and class 4). However, Fast growing organisms overlap with medium-fast organisms on the plot, which might explain the difficulty for the classifiers to properly classify them. PC2 B 75 50 25 0 25 40 0 40 80 PC Incubation time (days): 2 2 3 3 7 8 4 0 4 >4 Fig. 5: Confusion matrices from the best models (A), plot of the first two components of PCA over the full dataset (B) Conclusion This project focused on predicting microbial incubation times from genomic features. All classifiers trained yielded and values in the 0.4-0.5 range, indicating that no model was vastly superior. The classifiers were able to classify well fast and slow growing organisms but not the organisms whose incubation are in between (or extremely long). This lack of predictive power can be explained by the imbalance of classes, the small number of examples compared the number of features or, more likely, a lack of knowledge of the optimum growth condition for the organisms. Under optimal growth condition, it is possible that the organisms in class 2 and 3 would actually grow faster and would fall in class 0. 5

References:. Libbrecht MW, Noble WS. 205. Machine learning applications in genetics and genomics. Nature Reviews Genetics. 6(6):32 32 2. Lauro FM, McDougald D, Thomas T, Williams TJ, Egan S, et al. 2009. The genomic basis of trophic strategy in marine bacteria. Proc. Natl. Acad. Sci. U.S.A. 06(37):5527 33 3. Roller BRK, Stoddard SF, Schmidt TM. Exploiting rrna operon copy number to investigate bacterial reproductive strategies. Nature Microbiology. :660EP 4. Söhngen C, Bunk B, Podstawka A, Gleim D. 203. BacDive the Bacterial Diversity Metadatabase. Nucleic Acids Research 5. https://www.ncbi.nlm.nih.gov 6. Q. Song, J. Ni and G. Wang. A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data. IEEE Transactions on Knowledge and Data Engineering, vol. 25, no., pp. -4, Jan. 203. 6