Bioinformatics for Microbial Biology

Size: px
Start display at page:

Download "Bioinformatics for Microbial Biology"

Transcription

1 Bioinformatics for Microbial Biology Chaochun Wei ( 韦朝春 ) ccwei@sjtu.edu.cn Fall

2 Outline Part I: Visualization tools for microbial genomes Tools: Gbrowser Part II: Metagenomics Who is in there What are they doing How do they compare Some tools 2

3 Gbrowser: A visualization tool for microbial genomes 3

4 developed by GMOD Generic Model Organism Database GBrowse generic Genome Browser web application to explore genomes free software goal: simplify & standardize MODs Model Organism Databases Genome databases have similar requirements View DNA sequence and its associated features Allow zooming/scrolling

5 GBrowse as a Standard Human Genome Segmental Duplication Database

6 banner nstructions search overview region displayed details tracks ookies 6

7 Open View Hidden View (open) toggle (hide) control sections you want to view

8 Useful links Gbrowse Gbrowse mailing list GFF3 specification GFF3 validator Glyphs available in Gbrowse

9 Part II: Metagenomics 1. Reading materials 2. Metagenomics Metagenomic binning (composition structure) Function annotation Comparative metagenomics Metagenomic assembly Applications 9

10 Reading materials Craig, J., et al., Environmental Genome Shotgun Sequencing of the Sargasso Sea, Science (2004), 304, 66. Tringe, S. et al., Comparative Metagenomics of Microbial Communities, Science (2005) 308, 554. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics, PLoS Comput Biol (2010), Feb 26;6(2):e Qin, J. et al., A human gut microbial gene catlogue established by metagenomic sequencing, Nature, (2010), 464, 4. 10

11 Microbes Microbes are ubiquitous, and they are essential to all life. Bacteria, archaea, and microeukaryotes dominate Earth s habitats, compound recycling and nutrient sequestration.

12 Study of microbial genomes In the late 1970s, the genomes of bacteriophages MS2 and ϕ-x174 were sequenced. In 1995, the first bacterial genome Haemophilus influenza was sequenced. 1,767 completed bacterial genomes are deposited in the Genbank database until However, less than1% microbes can be cultured.

13 Metagenomics Definition: Metagenomics is the study of metagenomes, the genetic materials recovered directly from environmental samples.

14 Soil samples Sea water samples Seabed samples Air samples Medical samples Ancient bones Human microbiome Typical Sources of Metagenomes

15 Human contains not only the human genome In a healthy adult Microbial cells are 10 times more than human cells Most of them are in gut 15

16 Three basic questions in Metagenomics Who is in there? Metagenome binning What are they doing? Function annotation How do they compare? Comparative metagenomics

17 Q1:WHO IS IN THERE?

18 Two types of metagenomics methods 1. Marker gene sequencing: 16S rrna, 18S rrna, 2. Whole genome shotgun sequencing

19 Variant regions of 16S rrna gene We find that different taxonomic assignment methods vary radically in their ability to recapture the taxonomic information in full-length 16S rrna sequences: most methods are sensitive to the region of the 16S rrna gene that is targeted for sequencing, but many combinations of methods and rrna regions produce consistent and accurate results. To process large datasets of partial 16S rrna sequences obtained from surveys of various microbial communities, including those from human body habitats, we recommend the use of Greengenes or RDP classifier with fragments of at least 250 bases, starting from one of the primers R357, R534, R798, F343 or F517. Accurate taxonomy assignments from 16S rrna sequences produced by highly parallel pyrosequencers. Liu Z, DeSantis TZ, Andersen GL, Knight R. Nucleic Acids Res Oct;36(18):e120. Epub 2008 Aug 22.

20 16S rrna genes

21 Rarefaction curves. Red: species rich habitat, only a small fraction has been sampled Blue: this habitat has not been exhaustively sampled Green: most or all species have been sampled

22 Limitations of 16S classification The copy number of 16S rrna gene can vary by an order of magnitude between bacterial species PCR-induced biases.

23 Environmental Shotgun Sequencing Sampling from habitat; Filtering particles, typically by size; DNA extraction and lysis; Cloning and library; Sequence the clones; Computational analysis A primer on metagenomics. Wooley JC, Godzik A, Friedberg I. PLoS Comput Biol Feb 26;6(2):e Review.

24 Binning Alignment based methods Non-alignment based methods K-mer frequency (Codon usage) HMM/IMM

25 Measuring diversity Species diversity alpha-diversity, Shannon index Biodiversity in a defined habitat or ecosystem beta-diversity Species diversity between habitats gamma-diversity total biodiversity over a large region containing several ecosystems

26 Shannon index Measuring alpha diversity

27 Measuring beta diversity

28 Q2:WHAT ARE THEY DOING?

29 Gene calling Homology search, BLAST Sensitivity is low Specificity is high Ab initio gene prediction, Markov models.

30 The information contained in different lengths of genomic DNA.

31 Gene prediction methods used in metagenomic projects

32 Gene prediction on unassembled single reads and assembled contigs

33 Blast Megan MG-RAST OR Skip the gene calling step.

34 Q3:How Do They Compare?

35 Comparative Metagenomics Taxonomic and functional content.

36

37 METAGENOMIC ASSEMBLY

38 Metagenomic assembly Hamiltonian path. Eulerian path.

39 Sequencing coverage

40 Assembly software for NGS data ABySS ALLPATHS SOAP de novo VCAKE Velvet

41 124 samples Gb of sequences SOAPdenovo

42 Assembly result Assemble each sample 42.7% reads were assembled. 6.58M contigs N50 length: 2.2k Assemble pool sample 0.4M additional contigs were generated.(n50 length: 939)

43 APPLICATIONS

44 Unknown pathogen identification

45 Metagenomics and Environmental Virology

46 Viromes and gut microbiomes

47 MEGAN MG-RAST JGI IMG CAMERA Tools for Metagenomics

48 MEGAN

49 MEGAN lowest common ancestor

50 MEGAN Modified k-nn LCA taxonomic classifier Requires BLAST result file Extracts taxonomy, COGs from matches Features from NCBI Prokaryotic Attributes Table

51 Assignment of Reads to Taxa

52 Comparison and Analysis of Multiple Datasets

53 COG analysis

54 RDP

55 RDP

56 RDP tools for 16S pyrosequencing

57 RDP classifier

58 GREENGENES

59 Greengenes The purpose of the greengenes project is to computationally align all publicly available 16S sequences and place them phylogenetically. A scriptable, scalable, alignment method was conceived, NAST, producing a global 16S rrna gene Multiple Sequence Alignment. Over 9,000 Operational Taxonomic Units (OTUs) have been identified solely by sequence identity clustering. Details of the algorithms that were used were described in the journal Bioinformatics.

60 MG-RAST

61 The SEED & RAST Subsystems: Pathway database Expert annotation Curated simultaneously across many genomes FIGfams: Database of protein families Derived from Subsystems database Controlled addition of new family members RAST: Genome annotation system Uses FIGfams for gene annotation Uses Subsystems for pathway annotation

62 Workflow

63 Comparing the compositions of different metagenomes

64 Acknowledgement Some slides are provided by Peng Jia, CAS 64

65 65 SCBIT 机房