Genome Assembly. Background and Approach 28 Jan Jillian Walker Diana Williams

Size: px
Start display at page:

Download "Genome Assembly. Background and Approach 28 Jan Jillian Walker Diana Williams"

Transcription

1 Genome Assembly Background and Approach 28 Jan 2015 Jillian Walker Diana Williams Ke Qi Xin Wu Bhanu Gandham Anuj Gupta Taylor Griswold Yuanbo Wang Sung Im Maxine Harlemon Nicholas Kovacs

2 ObjecOves Evaluate new or infrequently used assemblers State of the art knowledge - for class & CDC Combine results from assemblers superassembly Best, highest- fidelity consensus sequence Create wrapper to streamline process Efficient task replicaoon

3 Outline 1. Biology and genome sequencing 2. Assembly pipeline a. Preprocessing b. Sequencing c. de novo assembly & conog integraoon d. Blast e. Reference assembly f. ConOg integraoon & comparison to de novo

4 mrna and You Img credit: Sisi Chen

5 mrna and You Gilbert, Scott. Developmental Biology.

6 Prepare DNA libraries Fragment DNA with socky ends Repair ends and add adapters

7 Process Sequencing: Extensive explanaoon provided by Illumina. Video (or this video) Powerpoint

8 Single- end vs. paired- end reads Jachob, J

9 Single- end vs. paired- end reads Single end reads Simple sequencing approach Paired end reads: Improved coverage over SE reads (and beber resoluoon of 3 end) Improved matching and conog assembly Beber predicoon of anomalies: inseroons, mutaoons, deleoons, etc. Mate paired reads: CombinaOon of short and long insert sizes Beber coverage of repeoove regions and large structural rearrangements Youtube

10 What kind of data do we have? Source Type Count GAII Single- end reads 9 HiSeq Paired- end reads 4 HiSeq/MiSeq Paired- end reads 84 Total reads: 97 Neisseria meningitidis (Nm): serogroupable and nonserogroupable Haemophilus influenzae (Hi): serotypable and non-serotypable Haemophilus haemolyticus (Hhae)

11 Outline 1. Biology and genome sequencing 2. Assembly pipeline a. Preprocessing b. Sequencing c. de novo assembly & conog integraoon d. Blast e. Reference assembly f. ConOg integraoon & comparison to de novo

12 Pipeline Preprocessing: Fastqc, printseq assembly with and without error correcoon: REPTILE de novo assembly: SPAdes ALLPATHS- LG* ABySS RAY* Minia* SPAdes de novo assembly QUAST will be used to evaluate the results of each FastQC read metrics ABySS de novo assembly RAW READS QUALITY READS QUAST assembly metrics Prinseq read trimming? de novo assembly? de novo assembly CISA will be used to merge the results of the four assembly programs to make a super assembly QUAST will be used again to evaluate the results of the super assembly We will use BLAST to find a reference Re- assemble against the reference using BWA CISA contig integrator QUAST assembly metrics * Not using due to various issues, discussed later. Try more assemblers

13 Preprocessing FastQC: trusted metrics program Input: BAM, SAM, or FASTQ (any variant) Output: Zip file containing HTML report How does it work?: FastQC runs a variety of modules to generate metrics data. - Basic StaOsOcs; per- base sequence quality; per- sequence quality scores; etc - Per- sequence GC content: can help ID contaminaoon or sequencing problems - Per base N content: shows proporoon of invalid base calls - Contains modules to ID overrepresented sequences, kmers. FastQC (and other metrics programs) can help us decide whether we need to trim our data and how This is an excellent guide to FastQC. Here is a manual. This page has example reports.

14 Preprocessing PrinSeq: summary and quality stats; allows filtering, reformarng, and trimming of sequence data Input: FASTA + QUAL (quality metrics); alternaovely, FASTQ Output: FASTA, FASTA+QUAL, FASTQ, FASTQ+input FASTA, or FASTQ +FASTA+QUAL How does it work?: Some funcoonal overlap with FastQC (GC content, base quality ) - BUT many novel features too ID of poly- A tails, sequence duplicaoons Mean sequence complexity Uses PCA to group metagenomes and help ID sequence contaminaoon Manual

15 Preprocessing What is removed from the raw sequence data? Primers and adapters that were added during library construcoon Noise from low quality reads Fastq file includes Phred score, a metric of confidence that the sequencing informaoon is good Sequences with a Phred score below 28 removed

16 De Novo Assembly Ekblom et. Al 2014

17 De Novo Assembly Sequencing reads are assembled into contigs Contigs arranged and gaps filled to form scaffold help/scaffolds.html Baker 2012

18 Assembly Algorithms Overlap- Layout- Consensus (OLC) Represents sequencing reads and their overlaps Pair- wise sequence alignments De Bruijn (k- mer) Nodes are all possible fixed- length strings Edges are overlaps between substrings More sensiove to repeats and sequencing errors than OLC Greedy Graph- based Overlap pairs of conogs, best overlap wins Extend by comparing overlaps of remaining conogs, best overlap wins Can have false posiove overlaps Miller et al. 2010

19 de Bruijn Graph Assembly leaves out unresolvable repeats breaks up sequence segments into smaller overlapping k- mers smaller k- mers are nodes edges connect nodes

20 de Bruijn Graph Assembly perfect sequencing if Eulerian at most, two semi- balanced nodes (incoming edges differ by outgoing edges by 1) remainder balanced nodes (incoming = outgoing) gaps in coverage can lead to disconnected graphs can get many walks that are Eulerian

21 de Bruijn Graph Assembly Miller et al. 2010

22 Flicek and Birney 2009

23 SPAdes A- Bruijn Assembler k- mers for building inioal de Bruijn graph then forgets the k- mers Does mulople assemblies at different k- mer sizes to opomize consensus DNA sequence at the end Bankevich et al. 2012

24 ABySS (Assembly by Short Sequencing) distributed de Bruijn graph approach All possible k- mers generated from sequence reads mate- pair informaoon used to extend conogs and remove ambiguioes in overlaps parallel sequence assembler Simpson et al. 2009

25 ALLPATHS- LG De Bruijn algorithm Hybrid de novo genome assembler Requires 2 PE libraries Gnerre et al , Butler et al. 2008

26 RAY de Bruijn algorithm seed the sequences seeds more informaove the more conogs that overlap it Issues with gerng fasta results out, only gerng metrics Boisvert et al. 2010

27 MINIA de Bruijn graph assembler It is part of GATB (Genome Assembly Tool Box) Bloom filter and algorithm to remove false posioves Less computaoonal Ome Chikhi and Rizk, 2012; Salikhow et al. 2013

28 Literature Comparison of Assemblers Salzberg et al. (2012) GAGE: Compared ABySS, ALLPATHS- LG, Bambus2, CABOG, MSR- CA, SGA, SOAPdenovo, and Velvet ConOg and scaffold N50 size as primary metric ALLPATHS- LG consistently strong SPAdes benchmarked against EULER- SR, SOAPdenovo, Velvet, Velvet- SC, and E+V- SC; comparable to other assemblers (Bankevich et al. 2012) RAY benchmarked against ABySS, EULER- SR, Newbler, and Velvet; claims beber results than Velvet and ABySS on Illumina data, scalable (Boisvert et al. 2010) Minia benchmarked against ABySS and SOAPdenovo; slightly higher N50

29 Combine Assemblies for Super Assembly CISA: ConOg Integrator for Sequence Assembly for bacterial genomes removes uncertainoes at ends of conogs minimum 30% overlap between conogs use longest conog in a region as representaove conog (subject to further correcoon) Lin and Liao 2013

30 QUAST to evaluate assemblies aggregated metrics and methods from exisong sowware does not need a reference genome Example of possible metrics: number of conogs, size of conogs, longest conog, total length, N50, L50 N50: # conogs/scaffolds shorter than the N50 approximately the same as the # of conogs/scaffolds longer than the N50 L50: length of the shortest conog/scaffold in the N50 Use a combinaoon of these metrics to rate each assembler Exact measure TBD Run awer each individual assembler and awer CISA Gurevich et al. 2013

31 Preliminary Comparisons 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Quast comparision for Denovo assemblers (pairedend) Spades Abyss Minia

32 Blast search for reference Awer de novo assembly, we can blast to find out what species/serogroup/serotype the genome belongs to Once we know the species/serogroup/serotype, we can use a reference genome to do reference- assisted assembly we can then compare de novo vs. reference- assisted to assess read quality

33 Reassemble against reference This will allow us to: 1. Take advantage of strong reference- assisted assembly programs 2. Assess the quality of our de novo assembly

34 Wrapper Wrap the pipeline into a single process to aid in the streamlining the assembly of many sequences start by tesong each step separately to ensure that each step works Waypoints in the process to check how the process is doing

35 References Baker De novo genome assembly: what every biologist should know. Nature Methods. 9: Bankevich et al SPAdes: a new genome assembly algorithm and its applicaoons to single- cell sequencing. J Comput Bio. 12: Boisvert et al Ray: simultaneous assembly of reads from a mix of high- throughput sequencing technologies. J Comput Bio. 17: Butler et al ALLPATHS: de novo assembly of whole- genome shotgun microreads. Genome Res. 18: Chikhi and Rizk Space- efficient and exact de Bruijn graph representaoon based on a Bloom filter. WABI. Ekblom and Wolf A field guide to whole- genome sequencing, assembly, and annotaoon. Evol. Appl. 7(9): Flicek and Birney Sense from sequence reads: methods for alignment and assembly. Nature Methods. 6: S6-12. Gnerre et al High- quality draw assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci. 108: Gurevich et al QUAST: quality assessment tool for genome assemblies. Bioinforma?cs. 29: Lin and Liao CISA: conog integrator for sequence assembly of bacterial genomes. PLoS ONE. 8(3): e Miller et al Assembly algorithms for Next- GeneraOon Sequencing data. Genomicsi. 95: Salikhov et al Using cascading Bloom filters to improve the memory usage for de Brujin graphs. WABI. Salzberg et al GAGE: a criocal evaluaoon of genome assemblies and assembly algorithms. Genome Res. 22: Simpson et al ABySS: a parallel assembler for short read sequence data. Genome Res. 19: Zerbino, D., and Birney, E Velvet : de novo assembly using very short reads. Genome Res. 18: