Qiong Wang. Xander Gene Targeted Metagenomic Assembler. June 4 th, Xander Gene Targeted Metagenomic Assembler

Size: px
Start display at page:

Download "Qiong Wang. Xander Gene Targeted Metagenomic Assembler. June 4 th, Xander Gene Targeted Metagenomic Assembler"

Transcription

1 Xander Gene Targeted Metagenomic Assembler Qiong Wang June 4 th, 2015 Center for Microbial Ecology Dept. of Plant, Soil and Microbial Sciences Michigan State University 1

2 Genome assembly Genome Assembly Repeats are a major problem Short reads with different error profiles Metagenomic bulk assembly Assuming same abundance, but not true in metagenomic data Metagenomes are highly diverse Big data, space and Ime complexity, need to discard low abundance reads before assembly 2

3 Profile Hidden Markov Models Widely used in many fields e.g. voice recogniion Protein and Nucleic Acid HMM Powerful gene search and assignment tool MulIple sequence alignments ProbabilisIc models on linear system, changes states according to a transiion rule only depends on the current state, independent of any other states A profile HMM has three states, 7 transiions between states, transiion and emission probabiliies

4 Protein Profile Hidden Markov Model Add insert states for extra residues I L R K V Insert state Match state Delete state

5 Xander: Gene- Targeted Assembler Combining de Bruijn Graph and HMM de Bruijn Graph gagccg ccggga ccgagc Xander combined weighted assembly graph I ccg gga 57 I ccg agc 57 M 56 I 56 M 57 I 57 M 58 M gag ccg 57 M ccg gga 58 M ccg agc 58 D 56 D 57 D 58 D gag ccg 58 Profile Hidden Markov Model Wang et al., 2015, Xander: Employing a Novel Method for Efficient Gene- Targeted Metagenomic Assembly. In revision

6 HMM- Guided Graph Search

7 hmp://rdp.cme.msu.edu HMP Defined Community Organism Name Strain Accession Number Streptococcus mutans NN2025 DNA NC_ (AP010655) Listeria monocytogenes L99 serovar 4a NC_ (FM211688) Acinetobacter baumannii ATCC NC_ (CP000521) AcJnomyces odontolyjcus ATCC DS Bacillus cereus ATCC AE Bacteroides vulgatus ATCC 8482 CP Candida albicans* SC5314 Assembly 21 N/A Clostridium beijerinckii NCIMB 8052 CP Deinococcus radiodurans R1 chromosome 1 AE Enterococcus faecalis OG1RF chromosome CP Escherichia coli K12 NC_ Helicobacter pylori NC_ Lactobacillus gasseri ATCC NC_ Methanobrevibacter smithii* ATCC NC_ Neisseria meningijdis MC58 NC_ Propionibacterium acnes KPA NC_ Pseudomonas aeruginosa PAO1 NC_ Rhodobacter sphaeroides chromosome 1 NC_ Staphylococcus aureus subsp. aureus USA300 TCH1516 NC_ Staphylococcus epidermidis ATCC NC_ Streptococcus agalacjae 2603V/R NC_ Streptococcus pneumoniae TIGR4 NC_

8 Xander ValidaSon Dataset: HMP defined community data (SRR172902, SRR172903), 1,037 Mbp of length 75 bp Illumina reads Conclusion: kmer length 45, prune 20 and Count 1 works well Count: minimum occurrence of kmers to be included in the graph Prune: stop the search if score has not improved in # of verices Accuracy measurements: 1. Number of errors found 2. Number of chimeric conigs formed 8

9 Comparison to SAT- Assembler SAT- Assembler (Zhang Y et al., PLoS Comput. Biology. 2014) Target gene: 50S ribosomal subunit protein L2 (rplb) (average length 825 bp) Xander was run with prune 20, kmer 45 and count 1 sekng. Xander recovered full or near full- length (94.6%) of 4 HMP defined members. SAT only recovered 79.9% of the 3 members. All conigs missed both ends. Sample HMP HMP & Corn # contigs # members recovered 14.5 M reads 24.7 M reads SAT Xander SAT Xander 4 6 * 3 4 * 9 Median gene coverage (%) * 100 Max gene coverage(%) * 100 Median % nucleoide idenity * 90.3 Max % nucleoide idenity * 100 Time (min) 12 a 5 a * 738 b * SAT did not complete a1er 100 h. a on imac, 3.2 GHz Intel Core i5 b MSU HPCC network drive

10 Biofuel Crops and Nitrogen Cycling Genes Corn Switchgrass Miscanthus amoa: ammonia monooxygenase nifh: nitrogen fixaion nirk/ nirs: nitrite reductase norb: nitric oxide reductase nosz: nitrous oxide reductase rplb: 50S ribosomal subunit protein L2

11 Rhizosphere Soil Data, Bulk Assembly, nirk 7 replicates from each crop from KBS intensive site one sample per lane of Illumina HiSeq, replicates were pooled before assembly Using Khmer protocol ( provided by Jiarong Guo, Howe et al., PNAS) Sample Name Corn Miscanthus Switchgrass File size (GB) Data size (Gbp) # protein conig clusters (99%) # OTUs at 95% aa idenity Median length (aa) Max length (aa) Median % aa idenity Max % aa idenity # reads covering kmers Gene Abundance hmp://rdp.cme.msu.edu

12 Rhizosphere Soil Data, Xander Assembly Gene nirk nifh rplb Crop C M S C M S C M S # chimeric clusters # protein contig clusters # OTUs at 95% aa identity Median (aa) Longest (aa) Median % aa identity Max % aa identity # reads covering kmers Gene Abundance hmp://rdp.cme.msu.edu

13 Rhizosphere Soil Data, Xander Assembly Gene nirk nifh rplb Crop C M S C M S C M S # chimeric clusters # protein contig clusters # OTUs at 95% aa identity Median (aa) Longest (aa) Use rplb gene to normalize gene abundance Read RaIo: # reads covering kmers in gene coigs / # reads covering kmers in rplb conigs Median % aa identity Max % aa identity # reads covering kmers Gene Abundance hmp://rdp.cme.msu.edu

14 nirk Kmer Abundance Kmer abundance of nitrite reductase gene (nirk) representaive conigs assembled by Xander from the pooled rhizosphere samples. More than 35% of kmers of length 45 in the conigs occurred only once in the reads 1x10 +0 Fraction of Kmers 1x10-1 1x10-2 1x10-3 1x10-4 Corn Miscanthus Switchgrass 1x10-5 1x Kmer Abundance 14

15 Mean Kmer Coverage Mean kmer coverage of a conig: mean number of reads containing each kmer in a conig. Counts for kmers that occurred in muliple conigs were equally divided. RepresentaIve conigs were chosen from clusters at 99% aa idenity 1000 nirk rplb Corn 100 Number of ConSgs 10 1 < >11 Mean Kmer Coverage Number of ConSgs < >15 Mean Kmer Coverage Miscanthus Switchgrass 15

16 Taxonomic Abundance rplb Xander, rplb Shotgun, 16S Percent of Abundance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Other GemmaImonadetes Planctomycetes Chloroflexi Verrucomicrobia Acidobacteria Bacteroidetes Firmicutes AcInobacteria Deltaproteobacteria Gammaproteobacteria Betaproteobacteria Alphaproteobacteria Acidobacteria has few (<10) cultured rplb representaives 16

17 Taxonomic Abundance nirk 100% 90% 80% Fungi Thermobaculum Firmicutes Percent Abundance 70% 60% 50% 40% 30% 20% 10% 0% Corn Miscanthus Switchgrass Spirochaetes Bacteroidetes Environmental Chloroflexi Verrucomicrobia Deltaproteobacteria Gammaproteobacteria Betaproteobacteria Alphaproteobacteria 15% of the nirk conigs were closest match to rplb from Bradyrhizobium japonicum USDA 110 The other top matches were: Ralstonia pickewi 12J, Rhodanobacter fulvus Jip2 17

18 PCA Analysis using OTU abundance at 95% aa idensty nirk rplb PC2 5.91% C M S C S M PC2 5.33% C M S C M S PC1 8.54% PC1 6.58% 18

19 MulSpath to find Sequence Heterogeneity Xander can find muliple paths using Yen s k shortest path algorithm 1 staring kmer, 1000 paths, 37 unique conigs hmp://rdp.cme.msu.edu

20 Xander Gene- targeted Assembly Processing StaSsScs 1 lane of Illumina Hiseq run in < 20 h Sample Name Mock K312 C1 7 Corns Data size (GB) Build graph (GB) Build graph Time (h) Find staring kmers (h) * hmp://rdp.cme.msu.edu Search conigs * min min h h nifh nirk NA rplb The processing Ime on MSU HPCC network drive, single CPU * can be mulithreaded or be run in parallel

21 Xander Gene Assembly Workflow Xander Build Modified HMMER3 Xander Search Quality- filtered Genes Quality Filtering Post- Assembly Analysis Read mapping, Gene coverage Nearest neighbor assignments Taxonomic abundance

22 Xander Assembly Prep Steps 1. Build specialized forward and reverse HMMs Input: a small set of aligned seed sequences (using original HMMER3 and HMMs from FunGene) Output: forward and reverse HMMs for Xander built using our modified HMMER3- mod which is tuned to detect close homologs 2. IdenIfy staring kmers Input 1: A larger set of reference sequences (cover all possible diversity) that was aligned by the forward HMMs using HMMER3- mod Input 2: read files Output: staring nucleoide kmers, alignment posiions, HMM states MulIple genes can be run together hmp://rdp.cme.msu.edu

23 Xander Assembly Steps 3. Build de Brujin graph Input: read files Output: de Bruijn graph structure 4. Assemble one path for each direcion for each start, then combine into one conig Input 1: forward and reverse HMMs Input 2: de Bruijn graph Input 3: staring kmers Output: nucleoide and protein conigs 5. Quality filter Length cutoff and HMM score cutoff Cluster at 99%, chose longest conigs (RDP mcclust) Chimera removal (UCHIME) Outputs: quality- filtered conigs hmp://rdp.cme.msu.edu

24 Xander Post- Assembly Analysis 6. Read Mapping (RDP KmerFilter) Input: quality- filtered conigs Output: coig coverage, kmer abundance 7. Nearest neighbor assignment, taxonomy abundance Input: quality- filtered conigs Input: reference seqs Input: coig coverage Output: nearest matches Taxonomic abundance adjusted by coverage 8. Beta- diversity analysis (muliple samples) Input: quality- filtered aligned protein conigs Input: conig coverage Output: coverage- adjusted OTU abundance matrix hmp://rdp.cme.msu.edu

25 Xander User Efficient Setup Xander GitHub repo hvps://github.com/rdpstaff/xander_assembler Step- by- step instrucions preconfigured with rplb gene, and nitrogen cycling genes including nirk, nirs, nifh, nosz, norb and amoa Prepare the HMMs, this step requires biological insight! Get reference sequences for gene(s) (FunGene, or literature search) Build specialized HMMs for Xander Get metagenomic data Go Xander assembly Choose the right parameters for your dataset, see instrucions 25

26 Summary Comparing to a recent targeted- gene assembler and a recent bulk assembly method, Xander assembled more gene conigs, longer in length and shared higher % aa idenity with known references. Detects low- abundance genes and low- abundance organisms. Provides gene abundance and kmer abundance esimate HMMs can be tailored to the targeted genes, allowing flexibility to improve annotaion over generic annotaion pipelines. Larger kmer size improves quality by reducing chimeras, but may results in shorter conigs. 26

27 James Cole James Tiedje Qiong Wang Jordan Fish Mariah Gilman Acknowledgements Yanni Sun C. Titus Brown Jiarong Guo 27