Large Scale Enzyme Func1on Discovery: Sequence Similarity Networks for the Protein Universe

Similar documents
Function Prediction of Proteins from their Sequences with BAR 3.0

Radical SAM Superfamily Workshop!!! John A. Gerlt!!! Enzyme Function Initiative!! September 29, 2012! Why are we here?!

Fast genome-wide functional annotation through orthology assignment by eggnog-mapper

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017

Korilog. high-performance sequence similarity search tool & integration with KNIME platform. Patrick Durand, PhD, CEO. BIOINFORMATICS Solutions

Product Applications for the Sequence Analysis Collection

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs

ELE4120 Bioinformatics. Tutorial 5

Using Ensembles of Hidden Markov Models in metagenomic analysis

Applying Hidden Markov Model to Protein Sequence Alignment

European Genome phenome Archive at the European Bioinformatics Institute. Helen Parkinson Head of Molecular Archives

Grand Challenges in Computational Biology

Open Metrics. Open Metrics. Martin Fenner Technical Lead Article-Level Metrics Public Library of Science

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

ELIXIR: data for molecular biology and points of entry for marine scientists

Introduction to Bioinformatics

Translating Biological Data Sets Into Linked Data

Data Mining for Biological Data Analysis

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Bioinformatics Practical Course. 80 Practical Hours

Textbook Reading Guidelines

BGGN 213: Foundations of Bioinformatics (Fall 2017)

Hidden Markov Models. Some applications in bioinformatics

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Bioinformatics for Cell Biologists

Two Mark question and Answers

MARINE BIOINFORMATICS & NANOBIOTECHNOLOGY - PBBT305

Finding Compensatory Pathways in Yeast Genome

Biology 644: Bioinformatics

Post-assembly Data Analysis

Introduction to EMBL-EBI.

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Bioinformatic tools for metagenomic data analysis

Machine Learning. HMM applications in computational biology

TIPP: Taxon Iden,fica,on using Phylogeny-Aware Profiles

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

Q4: Report on a new computational method for improving the interpretation of microbial, metagenomic, or plant genomes.

Introduc)on to Databases and Resources Biological Databases and Resources

Sequence Based Function Annotation

Problem Set 4. I) I) Briefly describe the two major goals of this paper. (2 pts)

Introduction to DNA-Sequencing

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

Perspectives on the Priorities for Bioinformatics Education in the 21 st Century

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004

Classification of Protein Sequences using the Growing Self-Organizing Map

Gene Prediction Background & Strategy Faction 2 February 22, 2017

ESSENTIAL BIOINFORMATICS

Deakin Research Online

all official ACS meetings and events without the express written consent from the ACS.

Bioinformatics (Globex, Summer 2015) Lecture 1

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Exercise I, Sequence Analysis

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Sequence Alignment and Phylogenetic Tree Construction of Malarial Parasites

proteins PREDICTION REPORT Fast and accurate automatic structure prediction with HHpred

Textbook Reading Guidelines

Functional Annotation - Faction 2 Background and Strategy

Bioinformatics for Proteomics. Ann Loraine

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

BIOINFORMATICS IN AQUACULTURE. Aleksei Krasnov AKVAFORSK (Ås, Norway) Bergen, September 21, 2007

Hot Topics. What s New with BLAST?

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Array-Ready Oligo Set for the Rat Genome Version 3.0

Glossary of Commonly used Annotation Terms

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes

Knowledge-Guided Analysis with KnowEnG Lab

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

GREG GIBSON SPENCER V. MUSE

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

BIOINFORMATICS Introduction

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Applied Bioinformatics

HPC in Bioinformatics and Genomics. Daniel Kahn, Clément Rezvoy and Frédéric Vivien Lyon 1 University & INRIA HELIX team LIP-ENS & INRIA GRAAL team

ECS 234: Genomic Data Integration ECS 234

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Sequence Databases and database scanning

IGS Annota*on Engine and Manatee. Michelle Gwinn Giglio Pathway Tools Workshop October 2010

Representation in Supervised Machine Learning Application to Biological Problems

Resolution of fine scale ribosomal DNA variation in Saccharomyces yeast

03-511/711 Computational Genomics and Molecular Biology, Fall

Course Presentation. Ignacio Medina Presentation

Worksheet for Bioinformatics

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

ONLINE BIOINFORMATICS RESOURCES

Why learn sequence database searching? Searching Molecular Databases with BLAST

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Biological databases an introduction

Understanding protein lists from comparative proteomics studies

Bioinformatics for Microbial Biology

Transcription:

Large Scale Enzyme Func1on Discovery: Sequence Similarity Networks for the Protein Universe Boris Sadkhin University of Illinois, Urbana-Champaign Blue Waters Symposium May 2015

Overview The Protein Sequence Database Problem Sequence Similarity Networks (SSNs) EFI-EST (Enzyme Similarity Tool) EST-Precompute

Personnel involved in this project Carl R. Woese Institute for Genomic Biology (IGB) at University of Illinois, Urbana-Champaign John A. Gerlt, PI Victor Jongeneel, CoPI Daniel Davidson David Slater External Collaborators Alex Bateman, EMBL-EBI Matthew Jacobson, UCSF

The Enzyme Function Initiative (EFI) The Enzyme Function Initiative, an NIH/NIGMS-supported Large-Scale Collaborative Project (EFI; U54GM093342; http://enzymefunction.org/) What do we do? collaborate create disseminate

An explosion of protein sequences! As of March 2015, 92,124,243 proteins had been identified.

The Problem The number of protein sequences is exploding! 50% of our protein databases are misannotated! There are many proteins and enzymes to discover!

The Solution A Sequence Similarity Network Database

Bridging the Gap : Biologists and Big Data

Generating the database on BW Biocluster @ IGB # of Nodes 20 EFI Nodes @24 cpu 20 Shared Nodes @24 cpu Blue Waters @ NCSA > 22,000 Nodes @ 32 cpu Storage (100TB) 600 TB for entire cluster 500 TB for just our project >90 million sequences 8 months < 2 weeks =4,243,438,028,099,403 comparisons Node hours? 200,000 node hours 6,400,000 cpu hours

What is a Sequence Similarity Network? node (circle) = protein sequence edge (line) = alignment score - log 10 [2 -bitscore (query length subject length)] Alignment Score

Using Sequence Similarity Networks

Using Sequence Similarity Networks

SSNS- Computationally Faster, Qualitatively Similar

Analyzing Groups of Proteins Multiple Sequence Alignment Phylogenetic Trees and Dendrograms Sequence Similarity Networks

Pros and Cons Multiple Sequence Alignment (MSA) Phylogenetic Trees Sequence Similarity Networks (SSNs) Visualization of Small Datasets Good Good Good Visualization of Large Datasets Bad Not so good Good Informative Small Datasets Large Datasets X Small Datasets Large Datasets X Small Datasets Large Datasets Computational Cost Expensive Requires Sensitive MSA Pairwise Sequence Alignment BLAST heuristics Displays Annotations? No Sometimes 26 (eg...crosslinks)

Our SSN Tools

efi.igb.illinois.edu/efi-est/

- Enzyme Similarity Tool Caveats: 100,000 sequence threshold for predefined families Takes time, networks need to be generated and regenerated for filtering

Gene3D PFAM Clans Interpro Families More? efi.igb.illinois.edu/est-precompute

Full SSNs each node = 1 sequence Representative SSNs each node > 1 sequence

EST & EST-Precompute use widely used database of conserved protein families that are based on a seed alignment of representative sequences that are used to generate a profile hidden Markov model (HMM) 14,831 defined families in Pfam http://pfam.xfam.org/

Challenges: The doubling time of the UniProt database (http://www.uniprot.org/), is ~ 18 months Adapting the workflow and algorithms for increasingly large sequence datasets Dealing with major changes in the databases from which we get our data

Our Workflow

Accomplishments Dealing with the explosion of protein sequences Algorithms Generated > 14,000 Pfams Production Pipeline

Blue Waters Team Contributions The Blue Waters Team has been helpful in dealing with our issues Live chat support Supplying job stats, optimizing our workflow, fixing software installations, you name it scheduler.x - the single threaded job scheduler

Thank You! Questions?

References Sequence Similarity Networks in the SFLD EFI EST Pfam http://www.sciencedirect.com/science/article/pii/ S1570963915001120 R.D. Finn, A. Bateman, J. Clements, P. Coggill, R.Y. Eberhardt, S.R. Eddy, A. Heger, K. Hetherington, L. Holm, J. Mistry, E.L. Sonnhammer, J. Tate, and M. Punta, Pfam: the protein families database. Nucleic Acids Res 2014, 42, D222-30. PMCID: PMC3965110 Uniprot C. UniProt UniProt: a hub for protein information Nucleic Acids Res, 43 (2015), pp. D204 D212 Collaborator Patsy Babbitt PMC http://www.ncbi.nlm.nih.gov/pmc/articles/ PMC2781113/ [4] http://www.ncbi.nlm.nih.gov/pmc/articles/ PMC1892569/ [5]