Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/ PDF Free Download

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015

The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn

Bioinformatics bottleneck

Bioinformatics challenges Methods: How do I analyze my data using procedures for various data types? Infrastructure: Where do I process my data? Large scale compute accessibility, Installing and maintaining software Standards: How do I ensure my results are useful? Common, shared formats using community developed software and tools

High throughput sequencing map http://omicsmaps.com/

The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207

The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information

The NCBI microbial annotation pipeline 1. Ab initio prediction of coding sequences: GeneMark and Glimmer Standalone Tools http://www.ncbi.nlm.nih.gov/genomes/microbes/microbial_taxtree.html 2. Automated annotation: NCBI Prokaryotic Genome Automatic Annotation Pipeline RPS-BLAST, BLASTX, TBLASTN http://www.ncbi.nlm.nih.gov/genomes/static/pipeline.html

The NCBI microbial annotation pipeline http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/

Other genomic resources Protein Clusters

Genome Annotation Checks (complete genomes) Why do we need to perform checks? garbage in garbage out We want to provide a tool that will check the annotation of a genome for anomalies that need to be examined further a measure of genome annotation Functions in conjunction with existing tools built into Sequin and checks made by GenBank staff during the submission process

Genome Annotation Checks (complete genomes) Takes input genomic file (ASN.1 format) Nucleotide sequence extracted trnascan used to search for missing trnas BLAST search - against all RefSeq proteins from complete genomes (E<10-6 ) RPS-BLAST against all Conserved Domain profiles (E<10-2 )

Genome Annotation Checks (complete genomes) Current submission 1. Potential frameshifts 2. RNA-CDS overlaps 3. CDS-CDS overlap 4. RNA-RNA overlap 5. missing trnas (complete) 6. missing rrna (5S, 16S, 23S) 7. truncated proteins (partial domain overlaps)

1. Potential frameshifts two or more adjacent genes encoding proteins that hit the same subject from BLAST results 5 Protein1 Protein2 3 Common BLAST hit spanning both proteins Protein3 Protein4

2. CDS-RNA Overlap RNAs completely overlapping (+/- strand) CDS and vice versa 5 Protein1 3 5 RNA2 3

3. CDS-CDS Overlap CDS completely overlapping (+/- strand) CDS 5 Protein1 3 5 Protein2 3

Use in RefSeq Missing or absent structural ribosomal RNAs were detected in all complete prokaryotic genomes (5S, 16S, 23S) Internal ribosomal RNA database is used for BLAST searches High scoring potential rrna is aligned against internal db Analyzed for missing, strand mismatches, length mismatches Currently added semiautomatically (automatically in the future)

Data Exchange EcoCyc publications protein interactions EcoGene publications gene locations, gene names, verified N-terminii PseudoCAP publications REBASE publications, protein names BRC preliminary data KEGG pathways, ortholog groups

How the Genome has changed? More complex genome structures (chromosomes, organelles, plasmids) Genome sequencing NextGen sequencing More complex genome assembly (chromosomes, scaffolds, contigs) Genome-scale projects - (transcriptome, exome, epigenomics, proteomics) Multi-isolate genome sequencing - (1001 Arabidopsis, 1000 human genomes) Meta-genomes Now useful for drug development

New resources at NCBI

New genomic resources at NCBI

New resources at NCBI

Why do we need new databases? Taxonomy BioSample BioProject Genome Assembly Nucleotide

BioProject, Genome, Assembly BioProject is an administrative object (defined by goal, target, funding, collaboration) Genome is a biological object defining an organism at molecular level Genome assembly is a complex data structure that defines the structure, relative position (scaffold) and chromosome placement of DNA sequences originated from a single sample

What is a Genome project? Genome project is a scientific endeavor that ultimately aims to determine the complete genome sequence of an organism and Aims to annotate protein-coding genes and other important genome-encoded features and Aims to understand the biology, physiology, and evolution of the organism.

Genome Project -> BioProject Random survey Metagenome Targeted sequencing Variant Discovery Population genomics Genome sequencing Ecosystem genomics Epigenomics Assembly Transcriptome sequencing Proteomics Annotation

BioProject data model Target Scope Objective Capture Mono-isolate Multi-isolate Multi-species Environmental Mono-isolate Multi-isolate Multi-species Environmental Mono-isolate Multi-isolate Multi-species Environmental Material Method DNA RNA Protein sequencing array proteomics

Why do we need a database of genome assemblies? We are in a period of extraordinary growth in genomics data. To get the full benefit from all this data, it is important that users can integrate data from different sources. Integration only works, if users know whether or not the different data were reported in the same coordinate system.

TB H37Rv Sanger vs. Broad Broad assembly (NC_018143) Sanger assembly (NC_000962)

Mycobacterium genomes at NCBI

Mycobacterium tuberculosis genomes

Mycobacterium tuberculosis overview

Mycobacterium tuberculosis genome annotation

Mycobacterium tuberculosis H37Rv

Mycobacterium tuberculosis H37Rv browser From the Gene record

Mycobacterium tuberculosis H37Rv GenePlot

BioProject, BioSample, Genome, Assembly, Nucleotide BioProject Single isolate Genome BioSample Assembly Nucleotide BioProject Single isolate Genome BioSample Assembly Nucleotide BioProject Single isolate Genome BioSample Assembly Assembly Assembly Nucleotide BioProject Multi isolate Genome BioSample BioSample BioSample BioSample Assembly Assembly Assembly Assembly Nucleotide SRA

NCBI genome submission dataflow metadata BioProject Common Submission Interface BioSample Sequence data SRA GenBank Contigs Genome Collection

Virtual machines in cloud environments Running the pipeline happens on the local machine, while the heavy lifting is done on the cloud/cluster

CloVR is a Virtual Machine Virtual Machine Pipelines: CloVR-Search CloVR-Microbe CloVR-16S CloVR-Metagenomics Angiuoli, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics

CloVR Architecture

Galaxy on the cloud Get Galaxy without the data or usage limitations. Combine with Cloud BioLinux to have access to MANY tools. Create an analysis cluster in minutes. Use autoscaling to get good performance at low cost. http://wiki.g2.bx.psu.edu/admin/cloud

Deploying Galaxy cluster on AWS 1. 2. 3. 4.

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015