Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Similar documents
NCBI web resources I: databases and Entrez

Prokaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine

GREG GIBSON SPENCER V. MUSE

Gene Prediction Group

Bacterial Genome Annotation

Types of Databases - By Scope

ELE4120 Bioinformatics. Tutorial 5

A Prac'cal Guide to NCBI BLAST

Chapter 2: Access to Information

Data Retrieval from GenBank

Genetics and Bioinformatics

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Two Mark question and Answers

Applied bioinformatics in genomics

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

What is Bioinformatics?

Gene-centered resources at NCBI

Introduction to BIOINFORMATICS

Introduction to Bioinformatics

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Sequence Based Function Annotation

Gene Prediction: Preliminary Results

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Introduction to 'Omics and Bioinformatics

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

ab initio and Evidence-Based Gene Finding

Overview of Health Informatics. ITI BMI-Dept

From Infection to Genbank

Analysis Report. Institution : Macrogen Japan Name : Macrogen Japan Order Number : 1501APB-0004 Sample Name : 8380 Type of Analysis : De novo assembly

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Genome Sequence Assembly

Introduction to Bioinformatics

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Transcription Start Sites Project Report

Bioinformatic tools for metagenomic data analysis

Introduction to Bioinformatics

NGS Approaches to Epigenomics

Investigation of Genomic Variation in the Rising Era of Individual Genome Sequence: A Primer on Some Available Datasets and Structures

MATH 5610, Computational Biology

COMPUTER RESOURCES II:

Product Applications for the Sequence Analysis Collection

Bioinformatics for Cell Biologists

I nternet Resources for Bioinformatics Data and Tools

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

B I O I N F O R M A T I C S

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Perspectives on the Priorities for Bioinformatics Education in the 21 st Century

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

BME 110 Midterm Examination

Sanger vs Next-Gen Sequencing

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA

Gene Identification in silico

Gene Prediction Final Presentation

Biology 644: Bioinformatics

Genome annotation & EST

Introduction to Bioinformatics

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Workflows and Pipelines for NGS analysis: Lessons from proteomics

Array-Ready Oligo Set for the Rat Genome Version 3.0

Chimp Sequence Annotation: Region 2_3

Functional Annotation - Faction 2 Background and Strategy

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Basic Bioinformatics: Homology, Sequence Alignment,

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah

European Genome phenome Archive at the European Bioinformatics Institute. Helen Parkinson Head of Molecular Archives

Small Exon Finder User Guide

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Klinisk kemisk diagnostik BIOINFORMATICS

Introduction to DNA-Sequencing

Applications of HMMs in Epigenomics

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

The University of California, Santa Cruz (UCSC) Genome Browser

NGS part 2: applications. Tobias Österlund


ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Bioinformatics for Microbial Biology

Applications of Next Generation Sequencing in Metagenomics Studies

Gene Prediction. Lab & Preliminary Results. Faction 2 Saturday, March 11, 2017

BGGN 213: Foundations of Bioinformatics (Fall 2017)

Advances in Biomedical Research at Comenius University Bratislava

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

2/19/13. Contents. Applications of HMMs in Epigenomics

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Protein Synthesis: From Gene RNA Protein Trait

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition

CBC Data Therapy. Metatranscriptomics Discussion

Introduc)on to Databases and Resources Biological Databases and Resources

High peformance computing infrastructure for bioinformatics

Transcription:

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015

The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn

Bioinformatics bottleneck

Bioinformatics challenges Methods: How do I analyze my data using procedures for various data types? Infrastructure: Where do I process my data? Large scale compute accessibility, Installing and maintaining software Standards: How do I ensure my results are useful? Common, shared formats using community developed software and tools

High throughput sequencing map http://omicsmaps.com/

The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207

The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207

The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207

The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information

The NCBI microbial annotation pipeline 1. Ab initio prediction of coding sequences: GeneMark and Glimmer Standalone Tools http://www.ncbi.nlm.nih.gov/genomes/microbes/microbial_taxtree.html 2. Automated annotation: NCBI Prokaryotic Genome Automatic Annotation Pipeline RPS-BLAST, BLASTX, TBLASTN http://www.ncbi.nlm.nih.gov/genomes/static/pipeline.html

The NCBI microbial annotation pipeline http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/

Other genomic resources Protein Clusters

Other genomic resources Protein Clusters

Genome Annotation Checks (complete genomes) Why do we need to perform checks? garbage in garbage out We want to provide a tool that will check the annotation of a genome for anomalies that need to be examined further a measure of genome annotation Functions in conjunction with existing tools built into Sequin and checks made by GenBank staff during the submission process

Genome Annotation Checks (complete genomes) Takes input genomic file (ASN.1 format) Nucleotide sequence extracted trnascan used to search for missing trnas BLAST search - against all RefSeq proteins from complete genomes (E<10-6 ) RPS-BLAST against all Conserved Domain profiles (E<10-2 )

Genome Annotation Checks (complete genomes) Current submission 1. Potential frameshifts 2. RNA-CDS overlaps 3. CDS-CDS overlap 4. RNA-RNA overlap 5. missing trnas (complete) 6. missing rrna (5S, 16S, 23S) 7. truncated proteins (partial domain overlaps)

1. Potential frameshifts two or more adjacent genes encoding proteins that hit the same subject from BLAST results 5 Protein1 Protein2 3 Common BLAST hit spanning both proteins Protein3 Protein4

2. CDS-RNA Overlap RNAs completely overlapping (+/- strand) CDS and vice versa 5 Protein1 3 5 RNA2 3

3. CDS-CDS Overlap CDS completely overlapping (+/- strand) CDS 5 Protein1 3 5 Protein2 3

Use in RefSeq Missing or absent structural ribosomal RNAs were detected in all complete prokaryotic genomes (5S, 16S, 23S) Internal ribosomal RNA database is used for BLAST searches High scoring potential rrna is aligned against internal db Analyzed for missing, strand mismatches, length mismatches Currently added semiautomatically (automatically in the future)

Data Exchange EcoCyc publications protein interactions EcoGene publications gene locations, gene names, verified N-terminii PseudoCAP publications REBASE publications, protein names BRC preliminary data KEGG pathways, ortholog groups

How the Genome has changed? More complex genome structures (chromosomes, organelles, plasmids) Genome sequencing NextGen sequencing More complex genome assembly (chromosomes, scaffolds, contigs) Genome-scale projects - (transcriptome, exome, epigenomics, proteomics) Multi-isolate genome sequencing - (1001 Arabidopsis, 1000 human genomes) Meta-genomes Now useful for drug development

New resources at NCBI

New genomic resources at NCBI

New resources at NCBI

Why do we need new databases? Taxonomy BioSample BioProject Genome Assembly Nucleotide

BioProject, Genome, Assembly BioProject is an administrative object (defined by goal, target, funding, collaboration) Genome is a biological object defining an organism at molecular level Genome assembly is a complex data structure that defines the structure, relative position (scaffold) and chromosome placement of DNA sequences originated from a single sample

What is a Genome project? Genome project is a scientific endeavor that ultimately aims to determine the complete genome sequence of an organism and Aims to annotate protein-coding genes and other important genome-encoded features and Aims to understand the biology, physiology, and evolution of the organism.

Genome Project -> BioProject Random survey Metagenome Targeted sequencing Variant Discovery Population genomics Genome sequencing Ecosystem genomics Epigenomics Assembly Transcriptome sequencing Proteomics Annotation

BioProject data model Target Scope Objective Capture Mono-isolate Multi-isolate Multi-species Environmental Mono-isolate Multi-isolate Multi-species Environmental Mono-isolate Multi-isolate Multi-species Environmental Material Method DNA RNA Protein sequencing array proteomics

Why do we need a database of genome assemblies? We are in a period of extraordinary growth in genomics data. To get the full benefit from all this data, it is important that users can integrate data from different sources. Integration only works, if users know whether or not the different data were reported in the same coordinate system.

TB H37Rv Sanger vs. Broad Broad assembly (NC_018143) Sanger assembly (NC_000962)

Mycobacterium genomes at NCBI

Mycobacterium tuberculosis genomes

Mycobacterium tuberculosis overview

Mycobacterium tuberculosis genome annotation

Mycobacterium tuberculosis H37Rv

Mycobacterium tuberculosis H37Rv browser From the Gene record

Mycobacterium tuberculosis H37Rv GenePlot

BioProject, BioSample, Genome, Assembly, Nucleotide BioProject Single isolate Genome BioSample Assembly Nucleotide BioProject Single isolate Genome BioSample Assembly Nucleotide BioProject Single isolate Genome BioSample Assembly Assembly Assembly Nucleotide BioProject Multi isolate Genome BioSample BioSample BioSample BioSample Assembly Assembly Assembly Assembly Nucleotide SRA

NCBI genome submission dataflow metadata BioProject Common Submission Interface BioSample Sequence data SRA GenBank Contigs Genome Collection

Virtual machines in cloud environments Running the pipeline happens on the local machine, while the heavy lifting is done on the cloud/cluster

CloVR is a Virtual Machine Virtual Machine Pipelines: CloVR-Search CloVR-Microbe CloVR-16S CloVR-Metagenomics Angiuoli, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics

CloVR Architecture

Galaxy on the cloud Get Galaxy without the data or usage limitations. Combine with Cloud BioLinux to have access to MANY tools. Create an analysis cluster in minutes. Use autoscaling to get good performance at low cost. http://wiki.g2.bx.psu.edu/admin/cloud

Deploying Galaxy cluster on AWS 1. 2. 3. 4.