Bioinformatics for Microbial Biology
|
|
- Jasmine Rice
- 5 years ago
- Views:
Transcription
1 Bioinformatics for Microbial Biology Chaochun Wei ( 韦朝春 ) ccwei@sjtu.edu.cn Fall
2 Outline Part I: Visualization tools for microbial genomes Tools: Gbrowser Part II: Metagenomics Who is in there What are they doing How do they compare Some tools 2
3 Gbrowser: A visualization tool for microbial genomes 3
4 developed by GMOD Generic Model Organism Database GBrowse generic Genome Browser web application to explore genomes free software goal: simplify & standardize MODs Model Organism Databases Genome databases have similar requirements View DNA sequence and its associated features Allow zooming/scrolling
5 GBrowse as a Standard Human Genome Segmental Duplication Database
6 banner nstructions search overview region displayed details tracks ookies 6
7 Open View Hidden View (open) toggle (hide) control sections you want to view
8 Useful links Gbrowse Gbrowse mailing list GFF3 specification GFF3 validator Glyphs available in Gbrowse
9 Part II: Metagenomics 1. Reading materials 2. Metagenomics Metagenomic binning (composition structure) Function annotation Comparative metagenomics Metagenomic assembly Applications 9
10 Reading materials Craig, J., et al., Environmental Genome Shotgun Sequencing of the Sargasso Sea, Science (2004), 304, 66. Tringe, S. et al., Comparative Metagenomics of Microbial Communities, Science (2005) 308, 554. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics, PLoS Comput Biol (2010), Feb 26;6(2):e Qin, J. et al., A human gut microbial gene catlogue established by metagenomic sequencing, Nature, (2010), 464, 4. 10
11 Microbes Microbes are ubiquitous, and they are essential to all life. Bacteria, archaea, and microeukaryotes dominate Earth s habitats, compound recycling and nutrient sequestration.
12 Study of microbial genomes In the late 1970s, the genomes of bacteriophages MS2 and ϕ-x174 were sequenced. In 1995, the first bacterial genome Haemophilus influenza was sequenced. 1,767 completed bacterial genomes are deposited in the Genbank database until However, less than1% microbes can be cultured.
13 Metagenomics Definition: Metagenomics is the study of metagenomes, the genetic materials recovered directly from environmental samples.
14 Soil samples Sea water samples Seabed samples Air samples Medical samples Ancient bones Human microbiome Typical Sources of Metagenomes
15 Human contains not only the human genome In a healthy adult Microbial cells are 10 times more than human cells Most of them are in gut 15
16 Three basic questions in Metagenomics Who is in there? Metagenome binning What are they doing? Function annotation How do they compare? Comparative metagenomics
17 Q1:WHO IS IN THERE?
18 Two types of metagenomics methods 1. Marker gene sequencing: 16S rrna, 18S rrna, 2. Whole genome shotgun sequencing
19 Variant regions of 16S rrna gene We find that different taxonomic assignment methods vary radically in their ability to recapture the taxonomic information in full-length 16S rrna sequences: most methods are sensitive to the region of the 16S rrna gene that is targeted for sequencing, but many combinations of methods and rrna regions produce consistent and accurate results. To process large datasets of partial 16S rrna sequences obtained from surveys of various microbial communities, including those from human body habitats, we recommend the use of Greengenes or RDP classifier with fragments of at least 250 bases, starting from one of the primers R357, R534, R798, F343 or F517. Accurate taxonomy assignments from 16S rrna sequences produced by highly parallel pyrosequencers. Liu Z, DeSantis TZ, Andersen GL, Knight R. Nucleic Acids Res Oct;36(18):e120. Epub 2008 Aug 22.
20 16S rrna genes
21 Rarefaction curves. Red: species rich habitat, only a small fraction has been sampled Blue: this habitat has not been exhaustively sampled Green: most or all species have been sampled
22 Limitations of 16S classification The copy number of 16S rrna gene can vary by an order of magnitude between bacterial species PCR-induced biases.
23 Environmental Shotgun Sequencing Sampling from habitat; Filtering particles, typically by size; DNA extraction and lysis; Cloning and library; Sequence the clones; Computational analysis A primer on metagenomics. Wooley JC, Godzik A, Friedberg I. PLoS Comput Biol Feb 26;6(2):e Review.
24 Binning Alignment based methods Non-alignment based methods K-mer frequency (Codon usage) HMM/IMM
25 Measuring diversity Species diversity alpha-diversity, Shannon index Biodiversity in a defined habitat or ecosystem beta-diversity Species diversity between habitats gamma-diversity total biodiversity over a large region containing several ecosystems
26 Shannon index Measuring alpha diversity
27 Measuring beta diversity
28 Q2:WHAT ARE THEY DOING?
29 Gene calling Homology search, BLAST Sensitivity is low Specificity is high Ab initio gene prediction, Markov models.
30 The information contained in different lengths of genomic DNA.
31 Gene prediction methods used in metagenomic projects
32 Gene prediction on unassembled single reads and assembled contigs
33 Blast Megan MG-RAST OR Skip the gene calling step.
34 Q3:How Do They Compare?
35 Comparative Metagenomics Taxonomic and functional content.
36
37 METAGENOMIC ASSEMBLY
38 Metagenomic assembly Hamiltonian path. Eulerian path.
39 Sequencing coverage
40 Assembly software for NGS data ABySS ALLPATHS SOAP de novo VCAKE Velvet
41 124 samples Gb of sequences SOAPdenovo
42 Assembly result Assemble each sample 42.7% reads were assembled. 6.58M contigs N50 length: 2.2k Assemble pool sample 0.4M additional contigs were generated.(n50 length: 939)
43 APPLICATIONS
44 Unknown pathogen identification
45 Metagenomics and Environmental Virology
46 Viromes and gut microbiomes
47 MEGAN MG-RAST JGI IMG CAMERA Tools for Metagenomics
48 MEGAN
49 MEGAN lowest common ancestor
50 MEGAN Modified k-nn LCA taxonomic classifier Requires BLAST result file Extracts taxonomy, COGs from matches Features from NCBI Prokaryotic Attributes Table
51 Assignment of Reads to Taxa
52 Comparison and Analysis of Multiple Datasets
53 COG analysis
54 RDP
55 RDP
56 RDP tools for 16S pyrosequencing
57 RDP classifier
58 GREENGENES
59 Greengenes The purpose of the greengenes project is to computationally align all publicly available 16S sequences and place them phylogenetically. A scriptable, scalable, alignment method was conceived, NAST, producing a global 16S rrna gene Multiple Sequence Alignment. Over 9,000 Operational Taxonomic Units (OTUs) have been identified solely by sequence identity clustering. Details of the algorithms that were used were described in the journal Bioinformatics.
60 MG-RAST
61 The SEED & RAST Subsystems: Pathway database Expert annotation Curated simultaneously across many genomes FIGfams: Database of protein families Derived from Subsystems database Controlled addition of new family members RAST: Genome annotation system Uses FIGfams for gene annotation Uses Subsystems for pathway annotation
62 Workflow
63 Comparing the compositions of different metagenomes
64 Acknowledgement Some slides are provided by Peng Jia, CAS 64
65 65 SCBIT 机房