Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015
The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn
Bioinformatics bottleneck
Bioinformatics challenges Methods: How do I analyze my data using procedures for various data types? Infrastructure: Where do I process my data? Large scale compute accessibility, Installing and maintaining software Standards: How do I ensure my results are useful? Common, shared formats using community developed software and tools
High throughput sequencing map http://omicsmaps.com/
The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207
The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207
The case for cloud computing in genome informatics http://genomebiology.com/2010/11/5/207
The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information
The NCBI microbial annotation pipeline 1. Ab initio prediction of coding sequences: GeneMark and Glimmer Standalone Tools http://www.ncbi.nlm.nih.gov/genomes/microbes/microbial_taxtree.html 2. Automated annotation: NCBI Prokaryotic Genome Automatic Annotation Pipeline RPS-BLAST, BLASTX, TBLASTN http://www.ncbi.nlm.nih.gov/genomes/static/pipeline.html
The NCBI microbial annotation pipeline http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/
Other genomic resources Protein Clusters
Other genomic resources Protein Clusters
Genome Annotation Checks (complete genomes) Why do we need to perform checks? garbage in garbage out We want to provide a tool that will check the annotation of a genome for anomalies that need to be examined further a measure of genome annotation Functions in conjunction with existing tools built into Sequin and checks made by GenBank staff during the submission process
Genome Annotation Checks (complete genomes) Takes input genomic file (ASN.1 format) Nucleotide sequence extracted trnascan used to search for missing trnas BLAST search - against all RefSeq proteins from complete genomes (E<10-6 ) RPS-BLAST against all Conserved Domain profiles (E<10-2 )
Genome Annotation Checks (complete genomes) Current submission 1. Potential frameshifts 2. RNA-CDS overlaps 3. CDS-CDS overlap 4. RNA-RNA overlap 5. missing trnas (complete) 6. missing rrna (5S, 16S, 23S) 7. truncated proteins (partial domain overlaps)
1. Potential frameshifts two or more adjacent genes encoding proteins that hit the same subject from BLAST results 5 Protein1 Protein2 3 Common BLAST hit spanning both proteins Protein3 Protein4
2. CDS-RNA Overlap RNAs completely overlapping (+/- strand) CDS and vice versa 5 Protein1 3 5 RNA2 3
3. CDS-CDS Overlap CDS completely overlapping (+/- strand) CDS 5 Protein1 3 5 Protein2 3
Use in RefSeq Missing or absent structural ribosomal RNAs were detected in all complete prokaryotic genomes (5S, 16S, 23S) Internal ribosomal RNA database is used for BLAST searches High scoring potential rrna is aligned against internal db Analyzed for missing, strand mismatches, length mismatches Currently added semiautomatically (automatically in the future)
Data Exchange EcoCyc publications protein interactions EcoGene publications gene locations, gene names, verified N-terminii PseudoCAP publications REBASE publications, protein names BRC preliminary data KEGG pathways, ortholog groups
How the Genome has changed? More complex genome structures (chromosomes, organelles, plasmids) Genome sequencing NextGen sequencing More complex genome assembly (chromosomes, scaffolds, contigs) Genome-scale projects - (transcriptome, exome, epigenomics, proteomics) Multi-isolate genome sequencing - (1001 Arabidopsis, 1000 human genomes) Meta-genomes Now useful for drug development
New resources at NCBI
New genomic resources at NCBI
New resources at NCBI
Why do we need new databases? Taxonomy BioSample BioProject Genome Assembly Nucleotide
BioProject, Genome, Assembly BioProject is an administrative object (defined by goal, target, funding, collaboration) Genome is a biological object defining an organism at molecular level Genome assembly is a complex data structure that defines the structure, relative position (scaffold) and chromosome placement of DNA sequences originated from a single sample
What is a Genome project? Genome project is a scientific endeavor that ultimately aims to determine the complete genome sequence of an organism and Aims to annotate protein-coding genes and other important genome-encoded features and Aims to understand the biology, physiology, and evolution of the organism.
Genome Project -> BioProject Random survey Metagenome Targeted sequencing Variant Discovery Population genomics Genome sequencing Ecosystem genomics Epigenomics Assembly Transcriptome sequencing Proteomics Annotation
BioProject data model Target Scope Objective Capture Mono-isolate Multi-isolate Multi-species Environmental Mono-isolate Multi-isolate Multi-species Environmental Mono-isolate Multi-isolate Multi-species Environmental Material Method DNA RNA Protein sequencing array proteomics
Why do we need a database of genome assemblies? We are in a period of extraordinary growth in genomics data. To get the full benefit from all this data, it is important that users can integrate data from different sources. Integration only works, if users know whether or not the different data were reported in the same coordinate system.
TB H37Rv Sanger vs. Broad Broad assembly (NC_018143) Sanger assembly (NC_000962)
Mycobacterium genomes at NCBI
Mycobacterium tuberculosis genomes
Mycobacterium tuberculosis overview
Mycobacterium tuberculosis genome annotation
Mycobacterium tuberculosis H37Rv
Mycobacterium tuberculosis H37Rv browser From the Gene record
Mycobacterium tuberculosis H37Rv GenePlot
BioProject, BioSample, Genome, Assembly, Nucleotide BioProject Single isolate Genome BioSample Assembly Nucleotide BioProject Single isolate Genome BioSample Assembly Nucleotide BioProject Single isolate Genome BioSample Assembly Assembly Assembly Nucleotide BioProject Multi isolate Genome BioSample BioSample BioSample BioSample Assembly Assembly Assembly Assembly Nucleotide SRA
NCBI genome submission dataflow metadata BioProject Common Submission Interface BioSample Sequence data SRA GenBank Contigs Genome Collection
Virtual machines in cloud environments Running the pipeline happens on the local machine, while the heavy lifting is done on the cloud/cluster
CloVR is a Virtual Machine Virtual Machine Pipelines: CloVR-Search CloVR-Microbe CloVR-16S CloVR-Metagenomics Angiuoli, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics
CloVR Architecture
Galaxy on the cloud Get Galaxy without the data or usage limitations. Combine with Cloud BioLinux to have access to MANY tools. Create an analysis cluster in minutes. Use autoscaling to get good performance at low cost. http://wiki.g2.bx.psu.edu/admin/cloud
Deploying Galaxy cluster on AWS 1. 2. 3. 4.