De novo genome assembly with next generation sequencing data!! " Jianbin Wang" HMGP 7620 (CPBS 7620, and BMGN 7620)" Genomics lectures" 2/7/12" Outline" The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The take home lessons" 1
The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes" Why/When do we need de novo genome assembly? Lots of interesting organisms don t have their genome sequences available! They have to be done using NGS de novo assembly! Within species, each individual has its own genome! For one individual, different cells may have genome alterations! 2
5/29/12 New genomes" Within species" 3
Within an individual" The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes" 4
The Nature of NGS Data" Higher parallel operation/yield! Much lower cost per base! Shorter (unfortunately)! 454: 200 400 bp! Illumina: 50 150 bp! Sanger sequencing: 600 1000 bp! ABI SOLiD: 35 75 bp! Platform-based characteristic errors! Illumina paired-end vs. mate pair sequencing" Paired-end! Mate pair! 5
The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes" De novo genome assembly concepts" Whole genome shortgun" sequencing" Genomic DNA! Genomic reads! Mate pair De novo assembly" Paired-end! Contig1! Contig2! Contig3! Contig4! Scaffold! Gaps! 6
Some vocabulary" Coverage (C)! C = 4" C k = 2" (k = 10)" C k = 3" (k = 5)" Kmer coverage (C k )! N50, N90!! Contig" N50 = 18,063 bp" N50 number = 4,175" N90 = 3,548 bp" N90 number = 16,950" Contig number Methods: Overlap-layout-consensus" Pair-wise sequence alignments (computationally expensive)! Construction an overlap graph to produce the reads layout! Multiple sequence alignments and generate consensus! Illumina! Examples: Phrap, Celera, Arachne, CAP, PCAP, Newbler,! 7
Methods: Eulerian path/de Bruijn graph" Kmer hash table! de Bruijn graph/ Eulerian path search! Examples: Euler, Velvet, Allpath, Abyss, SOAPdenovo,...! AGATGATTCG!! AGA! GAT! ATG! TGA! GAT! ATT! TTC! TCG! Illumina! Differences between an overlap graph and a de Bruijn graph" Schatz et. al 2010! 8
Methods - challenge" Repetitive sequence! DNA polymorphisms/sequencing errors! Non-uniform coverage (worse in Sanger sequencing)! Computational complexity of processing large volume of data! Reduced the complexity of the data" Sub-assembly (grouped assembly)! Repeat-masking! Reference based! 9
Additional Scaffolding" Related-genome as reference! cdnas/transcriptomes! Conserved proteins! Paired-end information! Reference genome - - cdna conserved protein! -.. -.... Contig1! Contig2! Contig3! Contig4! - - Scaffold! Genome assessment - coverage" Reads coverage/reads used! Physical coverage! Functional coverage! cdnas! Small RNAs!! 10
Genome assessment - continuity" Consistency to available genetic maps! Paired-end discrepancy! mrna/cdna intactness! The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes! 11
12
De novo genome assembly on NGS data" is feasible! is still a very hard problem! algorithm matters, but more important is the source of DNA and quality of the library! reference genome or other higher-order genetic map is of great value! put it into the biological context! References/Additional reading" Schatz, M. C., A. L. Delcher, et al. (2010). "Assembly of large genomes using second-generation sequencing." Genome research 20(9): 1165-1173.! Earl, D., K. Bradnam, et al. (2011). "Assemblathon 1: a competitive assessment of de novo short read assembly methods." Genome research 21 (12): 2224-2241.! Salzberg, S. L., A. M. Phillippy, et al. (2012). "GAGE: A critical evaluation of genome assemblies and assembly algorithms." Genome research.! Treangen, T. J. and S. L. Salzberg (2012). "Repetitive DNA and nextgeneration sequencing: computational challenges and solutions." Nature reviews. Genetics 13(1): 36-46.! 13