Genome Sequencing and Assembly

Size: px
Start display at page:

Download "Genome Sequencing and Assembly"

Transcription

1 Genome Sequencing and Assembly

2 History of Sequencing What was the first fully sequenced nucleic acid? Yeast trna (alanine trna) Robert Holley 1965 Image: Wikipedia

3 History of Sequencing Sequencing began with RNA, not DNA rrna Oligomer Cataloging Cellulose Acetate ph ~ d 1 24 ~ i 13x 0 13c~ -= n -a -r 9de 0 c 02 O 18b 17a j0zo ~4 h O~ilabc L~a Uchida et al. 1974; Woese et al. 1975

4 History of Sequencing Sequencing Milestones 1965 First nucleic acid sequenced: Yeast trna 1976 First complete genome sequenced (RNA virus: bacteriophage MS2) 1977 First complete DNA genome (Phage Φ-X174) 1995 First complete cellular genome (Haemophilus influenzae) and eukaryotic genome (yeast) sequenced 2001 Publication of the first sequenced human genome 2016 Todos Santos Genomics and Computational Biology Workshop first offered!

5 History of Sequencing Technological Advances 1975 Plus and minus DNA sequencing method (Sanger and Coulson) 1977 Maxam-Gilbert sequencing and Sanger DNA dideoxy terminator sequencing methods 1980s-1990s Refinements to Sanger Sequencing Fluorescent labeling of ddntps Capillary electrophoresis Automated basecalling Polymerase chain reaction (PCR) 2005 Introduction of 454 Sequencing and the NGS Revolution

6 Image: Drew Sheneman

7 Jigsaw Puzzle Genome Assembly Image: dreamstime.com

8 Assembly Algorithms Overlap Layout Consensus (OLC) e.g., Celera de Bruijn Graph e.g., ALLPATHS-LG, SPAdes, SOAPdenovo, Velvet

9 Overlap Layout Consensus Overlap: Alignment/comparison of ALL pairwise combinations of reads ACGTAGCTAGCATCGATCGATCGACTGATCGATCGATCGATCATC TAGCATCGATCGATCGACTGATCGTTCGATCGATCATCAGCATG Layout: Build contiguous sequences (contigs) by simplifying the network of observed overlaps Consensus: Determine sequence of contigs by eliminating ambiguities resulting from sequencing errors and/or nucleotide polymorphism.

10 de Bruijn Graph (i.e., Network) Does NOT require performing all possible pairwise comparisons between reads. Extract all k-mers from each read (i.e., subsequences of length k) to be the nodes in the graph. k = 5 Sequence Read ACAGGATAT k-mers ACAGG CAGGA AGGAT GGATA GATAT

11 de Bruijn Graph (i.e., Network) Edges (i.e., connections) in the graph defined between k-mers that overlap by k - 1 bases within reads. Genome Read 1 Read 2 Read 2 ACAGGATATGGATACCACG ACAGGATATGG GGATATGGATA TGGATACCACG k-mers (k = 5) ACAGG ATATG GATAC CAGGA TATGG ATACC AGGAT ATGGA TACCA GGATA TGGAT ACCAC GATAT GGATA CCACG ACAGG CAGGA AGGAT GGATA

12 de Bruijn Graph (i.e., Network) Edges (i.e., connections) in the graph defined between k-mers that overlap by k - 1 bases within reads. Genome Read 1 Read 2 Read 2 ACAGGATATGGATACCACG ACAGGATATGG GGATATGGATA TGGATACCACG k-mers (k = 5) ACAGG ATATG GATAC CAGGA TATGG ATACC AGGAT ATGGA TACCA GGATA TGGAT ACCAC GATAT GGATA CCACG ACAGG CAGGA AGGAT GGATA

13 de Bruijn Graph (i.e., Network) Edges (i.e., connections) in the graph defined between k-mers that overlap by k - 1 bases within reads. Genome Read 1 Read 2 Read 2 ACAGGATATGGATACCACG ACAGGATATGG GGATATGGATA TGGATACCACG ATGGA TATGG ATATG k-mers (k = 5) ACAGG ATATG GATAC CAGGA TATGG ATACC AGGAT ATGGA TACCA GGATA TGGAT ACCAC GATAT GGATA CCACG TGGAT GATAT ACAGG CAGGA AGGAT GGATA GATAC ATACC TACCA ACCAC CCACG

14 Assessing Assembly Contiguity Contig A contiguous sequence of nucleotides produced by genome assembly ACGTCATCGATGCATGCATGACGATCGTAGCATG Scaffold An assembled portion of a genome that can contain multiple contigs connected by structural information (e.g., paired-end reads, optical or genetic mapping, etc.) but separated by gaps. ACGTCATCGATGCATGCATGACGATCGTAGCATGNNNNNNNNNNACGATCGTAGCATCGATAACGT Contig (or Scaffold) N50 The longest length such that the sum of all contigs (or scaffolds) of that length or longer account for 50% of the total assembly. Contig (or Scaffold) L50 The smallest number of contigs (or scaffolds) that can account for 50% of the total assembly. Yes, N50 is a LENGTH and L50 is a NUMBER. It s meant to confuse you!

15 Assessing Assembly Contiguity Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Contig bp Total 4000 bp What is the N50 for this assembly? 400 bp What is the L50? 3

16 Examples of Short-Read Assemblers ALLPATHS-LG: MaSuRCA: SOAPdenovo: SPAdes: Velvet:

17 PacBio Coming into its Own Single-molecule sequencing of the desiccationtolerant grass Oropetium thomaeum. Robert VanBuren 1 *, Doug Bryant 1 *, Patrick P. Edger 2,3, Haibao Tang 4, Diane Burgess 2, Dinakar Challabathula 5, Kristi Spittle 6, Richard Hall 6, Jenny Gu 6, Eric Lyons 4, Michael Freeling 2, Dorothea Bartels 5, Boudewijn Ten Hallers 7, Alex Hastie 7, Todd P. Michael 8 & Todd C. Mockler 1 72x PacBio coverage (no Illumina data) Assembly of a 224 Mb plant genome with contig N50 of 2.4 Mb Image: Pacific Biosciences

18 Exercise Endosymbiotic Bacteria in Whiteflies Intracellular Some obligate, some facultative Various functional roles, including synthesizing nutrients that are lacking in a plant-sap diet Candidatus Portiera aleyrodidarum is an ancient gamma-proteobacteral endosymbiont present in all whiteflies. Image: Gottlieb et al. 2010

19 Exercise ~/TodosSantos/genome_assembly Perform Velvet de novo assembly of bacterial genome using Illumina data Visualize genome assembly with Tablet Calculate assembly summary statistics with Perl script Repeat assembly with varying amounts of sequence coverage and different k-mer sizes